[
https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263866#comment-13263866
]
Rohini Palaniswamy commented on PIG-2672:
-----------------------------------------
Proposed Solution:
* For each user create a .pig directory. For eg: /user/rohini/.pig. Copy the
pig libraries to /user/rohini/.pig/piglib/pig-[version]/ and then add them to
distributed cache. If the jars are already present in hdfs, just add them to
distributed cache.
* Copy the user libraries to
/user/rohini/.pig/userlib/jarname-[checksum|filesize].jar and then add them to
distributed cache.If the jar with same checksum is already present in hdfs,
just add it to distributed cache.
* This will allow shipping of jars/udfs only once to the cluster and prevent
multiple copies in different locations in tasktracker.
* Reasoning for copying the jar with the checksum or filesize included in
the name is to avoid job failures due to overwriting of jars. For eg: if there
is user jar, that is copied as part of one pig job. If the user runs another
pig job with a modified version of the same jar while the old job is running,
there will be a conflict. The cleanup job checks if the files in distributed
cache have same timestamp as the original hdfs file and fail the job if that is
not the case. So even if the old job's map/reduce task completed successfully
it will fail in cleanup.
* This solution can be a configuration. If turned off, it can revert to the
old behaviour.
We have used this approach for our dataloading application which runs close
to >50K jobs everyday that shipped around 5 jars and this improved job launch
performance quite a bit. With more number of jars in pig it should show more
improvement in the performance. Currently pig takes a relatively long time to
launch a job.
> Optimize the use of DistributedCache
> ------------------------------------
>
> Key: PIG-2672
> URL: https://issues.apache.org/jira/browse/PIG-2672
> Project: Pig
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
>
> Pig currently copies jar files to a temporary location in hdfs and then adds
> them to DistributedCache for each job launched. This is inefficient in terms
> of
> * Space - The jars are distributed to task trackers for every job taking
> up lot of local temporary space in tasktrackers.
> * Performance - The jar distribution impacts the job launch time.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira