Cheolsoo Park created PIG-4241:
----------------------------------

             Summary: Auto local mode mistakenly converts large jobs to local 
mode when using with Hive tables
                 Key: PIG-4241
                 URL: https://issues.apache.org/jira/browse/PIG-4241
             Project: Pig
          Issue Type: Bug
          Components: impl
            Reporter: Cheolsoo Park
            Assignee: Cheolsoo Park
             Fix For: 0.15.0


The current implementation of auto local mode has two severe problems-
# It assumes file-based inputs, and it always converts jobs with non-file-based 
inputs into local mode unless the 
{{LoadMetadata.getStatistics().getSizeInBytes()}} returns >100M. This is 
particularly problematic when using Pig with Hive tables with custom LoadFuncs 
that did not implement LoadMetadata interface.
# It lists all the files to compute the total size. The algorithm is like this. 
First, compute the total size. Second, compare it against the configured max 
bytes. This is very time-consuming when Pig job loads a large number of files. 
It will list all the files only to compute the total size. Instead, we should 
stop computing the sum of input sizes as soon as it becomes the max bytes-
{code:title=JobControlCompiler.java}
long totalInputFileSize = InputSizeReducerEstimator.getTotalInputFileSize(conf, 
lds, job); // THIS IS BAD!
long inputByteMax = 
conf.getLong(PigConfiguration.PIG_AUTO_LOCAL_INPUT_MAXBYTES, 100*1000*1000l);
log.info("Size of input: " + totalInputFileSize +" bytes. Small job threshold: 
" + inputByteMax );
if (totalInputFileSize < 0 || totalInputFileSize > inputByteMax) {
        return false;
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to