[
https://issues.apache.org/jira/browse/PIG-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816848#comment-13816848
]
Rohini Palaniswamy commented on PIG-3564:
-----------------------------------------
Few notes:
- Noticed that the hdfs delegation token is got multiple times even though
it should not get when if the Credentials already has a token with that
service. Happens with mapreduce as well even now. Should be some hadoop bug.
Will look at it later.
- [~sseth] is looking at the failure due to security in Vertex trying to
talk to AM - TEZ-606 and has a initial patch to try out. If that does not work
will continue later, as he is out for 10 days starting tomorrow and Hitesh is
out as well.
- I had getting splits moved from AM to the client itself similar to
mapreduce as FileInputFormat fetches delegation tokens. Even other input
formats like HCatInputFormat do that. There are three ways to get the split
information with Tez and got the rationale behind the different methods after
talking to Sidd. This patch goes with the third approach for now.
1) Let the AM calculate the splits
- This is the best but the problem is will not work with security as
some InputFormat's fetch tokens and add to the job which can be done only on
the client which has the kerberos credentials. Sidd said the hive team saw huge
gains with this when there were map side joins with multiple loads and splits
were parallely calculated.
2) Get the splits and write to a DFS file.
- This is similar to what mapreduce is doing now. The split files are
localized to each task node and they index into it and read. Disadvantage is
that the whole split file has to be distributed to all task nodes even though
they read only a part of it. Another thing is it prevents container reuse if
the localized resources are different due to different split files for the
vertices.
3) Get the splits in memory and add that to the user pay load.
- This serializes splits into the user pay load itself and avoids
having to distribute a separate splits file. Also the configuration and the
relevant portion of the split information is sent to the task node from AM
through RPC which has proved to be faster. But the problem is if there are huge
number of splits, AM will require lot of memory and there will be lot of over
the wire transfer.
> Make Tez work with security
> ---------------------------
>
> Key: PIG-3564
> URL: https://issues.apache.org/jira/browse/PIG-3564
> Project: Pig
> Issue Type: Sub-task
> Components: tez
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: tez-branch
>
> Attachments: PIG-3564-inital.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.1#6144)