[
https://issues.apache.org/jira/browse/PIG-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183432#comment-14183432
]
Daniel Dai commented on PIG-4241:
---------------------------------
Does all non-file-based job convert to auto local mode? If so, let's put the
fix also in 0.14. Can you add a comment to explain the method parameter max in
getPathLength and getTotalInputFileSize?
> Auto local mode mistakenly converts large jobs to local mode when using with
> Hive tables
> ----------------------------------------------------------------------------------------
>
> Key: PIG-4241
> URL: https://issues.apache.org/jira/browse/PIG-4241
> Project: Pig
> Issue Type: Bug
> Components: impl
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4241-1.patch, PIG-4241-2.patch
>
>
> The current implementation of auto local mode has two severe problems-
> # It assumes file-based inputs, and it always converts jobs with
> non-file-based inputs into local mode unless the
> {{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is
> particularly problematic when using Pig with Hive tables with custom
> LoadFuncs that did not implement LoadMetadata interface.
> # It lists all the files to compute the total size. The algorithm is like
> this. First, compute the total size. Second, compare it against the
> configured max bytes. This is very time-consuming when Pig job loads a large
> number of files. It will list all the files only to compute the total size.
> Instead, we should stop computing the sum of input sizes as soon as it
> becomes the max bytes-
> {code:title=JobControlCompiler.java}
> long totalInputFileSize =
> InputSizeReducerEstimator.getTotalInputFileSize(conf, lds, job); // THIS IS
> BAD!
> long inputByteMax =
> conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l);
> log.info("Size of input: " + totalInputFileSize +" bytes. Small job
> threshold: " + inputByteMax );
> if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) {
> return false;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)