Cheolsoo Park created PIG-4241:
----------------------------------
Summary: Auto local mode mistakenly converts large jobs to local
mode when using with Hive tables
Key: PIG-4241
URL: https://issues.apache.org/jira/browse/PIG-4241
Project: Pig
Issue Type: Bug
Components: impl
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Fix For: 0.15.0
The current implementation of auto local mode has two severe problems-
# It assumes file-based inputs, and it always converts jobs with non-file-based
inputs into local mode unless the
{{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is
particularly problematic when using Pig with Hive tables with custom LoadFuncs
that did not implement LoadMetadata interface.
# It lists all the files to compute the total size. The algorithm is like this.
First, compute the total size. Second, compare it against the configured max
bytes. This is very time-consuming when Pig job loads a large number of files.
It will list all the files only to compute the total size. Instead, we should
stop computing the sum of input sizes as soon as it becomes the max bytes-
{code:title=JobControlCompiler.java}
long totalInputFileSize = InputSizeReducerEstimator.getTotalInputFileSize(conf,
lds, job); // THIS IS BAD!
long inputByteMax =
conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l);
log.info("Size of input: " + totalInputFileSize +" bytes. Small job threshold:
" + inputByteMax );
if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) {
return false;
}
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)