[ 
https://issues.apache.org/jira/browse/PIG-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816848#comment-13816848
 ] 

Rohini Palaniswamy commented on PIG-3564:
-----------------------------------------

Few notes:
   - Noticed that the hdfs delegation token is got multiple times even though 
it should not get when if the Credentials already has a token with that 
service. Happens with mapreduce as well even now. Should be some hadoop bug. 
Will look at it later.
   - [~sseth] is looking at the failure due to security in Vertex trying to 
talk to AM - TEZ-606 and has a initial patch to try out. If that does not work 
will continue later, as he is out for 10 days starting tomorrow and Hitesh is 
out as well.
   
   - I had getting splits moved from AM to the client itself similar to 
mapreduce as FileInputFormat fetches delegation tokens. Even other input 
formats like HCatInputFormat do that. There are three ways to get the split 
information with Tez and got the rationale behind the different methods after 
talking to Sidd. This patch goes with the third approach for now.
   1) Let the AM calculate the splits
         - This is the best but the problem is will not work with security as 
some InputFormat's fetch tokens and add to the job which can be done only on 
the client which has the kerberos credentials. Sidd said the hive team saw huge 
gains with this when there were map side joins with multiple loads and splits 
were parallely calculated. 
   2) Get the splits and write to a DFS file.
        - This is similar to what mapreduce is doing now. The split files are 
localized to each task node and they index into it and read. Disadvantage is 
that the whole split file has to be distributed to all task nodes even though 
they read only a part of it. Another thing is it prevents container reuse if 
the localized resources are different due to different split files for the 
vertices.
   3) Get the splits in memory and add that to the user pay load.
        - This serializes splits into the user pay load itself and avoids 
having to distribute a separate splits file. Also the configuration and the 
relevant portion of the split information is sent to the task node from AM 
through RPC which has proved to be faster. But the problem is if there are huge 
number of splits, AM will require lot of memory and there will be lot of over 
the wire transfer. 

> Make Tez work with security
> ---------------------------
>
>                 Key: PIG-3564
>                 URL: https://issues.apache.org/jira/browse/PIG-3564
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: tez-branch
>
>         Attachments: PIG-3564-inital.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to