[
https://issues.apache.org/jira/browse/PIG-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheolsoo Park updated PIG-4241:
-------------------------------
Attachment: PIG-4241-1.patch
Attached patch includes the following changes-
# When Hive table name is interpreted as a hdfs file, {{globStatus()}} returns
null. In this case, {{InputSizeReducerEstimator.getTotalInputFileSize()}}
returns -1 now. Therefore, big jobs with Hive table input do not get converted
to local mode.
# Max parameter is add to
{{InputSizeReducerEstimator.getTotalInputFileSize()}}. Now when it computes the
total input size recursively, it exits as soon as it reaches the max. This
helps avoid listing all the files to determine whether the job can be converted
to local mode or not.
> Auto local mode mistakenly converts large jobs to local mode when using with
> Hive tables
> ----------------------------------------------------------------------------------------
>
> Key: PIG-4241
> URL: https://issues.apache.org/jira/browse/PIG-4241
> Project: Pig
> Issue Type: Bug
> Components: impl
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
> Fix For: 0.15.0
>
> Attachments: PIG-4241-1.patch
>
>
> The current implementation of auto local mode has two severe problems-
> # It assumes file-based inputs, and it always converts jobs with
> non-file-based inputs into local mode unless the
> {{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is
> particularly problematic when using Pig with Hive tables with custom
> LoadFuncs that did not implement LoadMetadata interface.
> # It lists all the files to compute the total size. The algorithm is like
> this. First, compute the total size. Second, compare it against the
> configured max bytes. This is very time-consuming when Pig job loads a large
> number of files. It will list all the files only to compute the total size.
> Instead, we should stop computing the sum of input sizes as soon as it
> becomes the max bytes-
> {code:title=JobControlCompiler.java}
> long totalInputFileSize =
> InputSizeReducerEstimator.getTotalInputFileSize(conf, lds, job); // THIS IS
> BAD!
> long inputByteMax =
> conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l);
> log.info("Size of input: " + totalInputFileSize +" bytes. Small job
> threshold: " + inputByteMax );
> if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) {
> return false;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)