[
https://issues.apache.org/jira/browse/PIG-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227280#comment-13227280
]
Dmitriy V. Ryaboy commented on PIG-2573:
----------------------------------------
Nice work, Travis.
Tests pass.
I think we might as well take PigStorageWithStatistics and move it into
PigStorage. Might make sense to save a transient LRU map of location->stats in
case getStats gets called a few times, so we don't keep hitting the NN for file
sizes over and over.
It looks like the current behavior is that if you have multiple POLoads, and
only a subset of them implement LoadMetadata and return non-0 size, FS is used.
Meaning, if I have 2 loaders, and one of them reports size and the other does
not, the first loader's reported size is ignored. That's ok (better than it is
now!) but not ideal. Perhaps we can move the logic of checking metadata or the
FS into the inner loop of getInputSizeFromLoadMetadata?
Also, just stylistically, lets rename that method to getInputSizeFromMetadata
(more readable, and refers to the concept, not an interface).
> Automagically setting parallelism based on input file size does not work with
> HCatalog
> --------------------------------------------------------------------------------------
>
> Key: PIG-2573
> URL: https://issues.apache.org/jira/browse/PIG-2573
> Project: Pig
> Issue Type: Bug
> Reporter: Travis Crawford
> Assignee: Travis Crawford
> Attachments: PIG-2573_get_size_from_stats_if_possible.diff,
> PIG-2573_move_getinputbytes_to_loadfunc.diff
>
>
> PIG-2334 was helpful in understanding this issue. Short version is input file
> size is only computed if the path begins with a whitelisted prefix, currently:
> * /
> * hdfs:
> * file:
> * s3n:
> As HCatalog locations use the form {{dbname.tablename}} the input file size
> is not computed, and the size-based parallelism optimization breaks.
> DETAILS:
> I discovered this issue comparing two runs on the same script, one loading
> regular HDFS paths, and one with HCatalog db.table names. I just happened to
> notice the "Setting number of reducers" line difference.
> {code:title=Loading HDFS files reducers is set to 99}
> 2012-03-08 01:33:56,522 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=98406674162
> 2012-03-08 01:33:56,522 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Neither PARALLEL nor default parallelism is set for this job. Setting
> number of reducers to 99
> {code}
> {code:title=Loading with an HCatalog db.table name}
> 2012-03-08 01:06:02,283 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0
> 2012-03-08 01:06:02,283 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Neither PARALLEL nor default parallelism is set for this job. Setting
> number of reducers to 1
> {code}
> Possible fix: Pig should just ask the loader for the size of its inputs
> rather than special-casing certain location types.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira