[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache

Rohini Palaniswamy (JIRA) Fri, 27 Apr 2012 11:25:10 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263866#comment-13263866
 ]


Rohini Palaniswamy commented on PIG-2672:
-----------------------------------------

Proposed Solution:
   * For each user create a .pig directory. For eg: /user/rohini/.pig. Copy the 
pig libraries to /user/rohini/.pig/piglib/pig-[version]/ and then add them to 
distributed cache. If the jars are already present in hdfs, just add them to 
distributed cache.
   *  Copy the user libraries to 
/user/rohini/.pig/userlib/jarname-[checksum|filesize].jar and then add them to 
distributed cache.If the jar with same checksum is already present in hdfs, 
just add it to distributed cache.  
   * This will allow shipping of jars/udfs only once to the cluster and prevent 
multiple copies in different locations in tasktracker.
    * Reasoning for copying the jar with the checksum or filesize included in 
the name is to avoid job failures due to overwriting of jars. For eg: if there 
is user jar, that is copied as part of one pig job. If the user runs another 
pig job with a modified version of the same jar while the old job is running, 
there will be a conflict. The cleanup job checks if the files in distributed 
cache have same timestamp as the original hdfs file and fail the job if that is 
not the case. So even if the old job's map/reduce task completed successfully 
it will fail in cleanup. 
   * This solution can be a configuration. If turned off, it can revert to the 
old behaviour. 

   We have used this approach for our dataloading application which runs close 
to >50K jobs everyday that shipped around 5 jars and this improved job launch 
performance quite a bit. With more number of jars in pig it should show more 
improvement in the performance. Currently pig takes a relatively long time to 
launch a job.  
                
> Optimize the use of DistributedCache
> ------------------------------------
>
>                 Key: PIG-2672
>                 URL: https://issues.apache.org/jira/browse/PIG-2672
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>
> Pig currently copies jar files to a temporary location in hdfs and then adds 
> them to DistributedCache for each job launched. This is inefficient in terms 
> of 
>    * Space - The jars are distributed to task trackers for every job taking 
> up lot of local temporary space in tasktrackers.
>    * Performance - The jar distribution impacts the job launch time.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2672) Optimize the use of DistributedCache

Reply via email to