Pradeep Kamath commented on PIG-872:

Distributed cache can be used for the case where the "replicate" input for the 
fragment-replicate join is a file present on DFS at query start time. When the 
replicate input is an intermediate output in the query (say the output of a 
filter), the client side will know the filename of this intermediate output - 
but we have to check that hadoop will honor the ditributed cache property only 
while launching the job corresponding to the FR join (so that at that time, the 
file is present on dfs) - if this is not the case, the approach may not work 
for intermediate output scenarios.

> use distributed cache for the replicated data set in FR join
> ------------------------------------------------------------
>                 Key: PIG-872
>                 URL: https://issues.apache.org/jira/browse/PIG-872
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
> Currently, the replicated file is read directly from DFS by all maps. If the 
> number of the concurrent maps is huge, we can overwhelm the NameNode with 
> open calls.
> Using distributed cache will address the issue and might also give a 
> performance boost since the file will be copied locally once and the reused 
> by all tasks running on the same machine.
> The basic approach would be to use cacheArchive to place the file into the 
> cache on the frontend and on the backend, the tasks would need to refer to 
> the data using path from the cache.
> Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
> us right now as we don't use it.)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to