[jira] [Commented] (PIG-2573) Automagically setting parallelism based on input file size does not work with HCatalog

Travis Crawford (Commented) (JIRA) Mon, 12 Mar 2012 15:01:06 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227962#comment-13227962
 ]


Travis Crawford commented on PIG-2573:
--------------------------------------

PIGSTORAGEWITHSTATISTICS COMMENT:

Originally I did something similar to what you suggested, but after a bit more 
thought kept PigStorage unchanged, and used a test-specific loader. Since we 
fall back to the existing "get size from supported filesystems" lookup, 
PigStorage already has this feature for most users. JobControlCompiler and 
PigStorage would call the same utility method to report size, so I think the 
code is actually more complex by updating PigStorage.

The goal here is letting a loader report the size of its input for 
non-filesystems (hcatalog db.table names, rows from hbase/vertica/mysql/...) , 
or when doing something fancy with files on a filesystem (indexed files where 
blocks/splits are pre-filtered). If you're doing something fancy you probably 
have a fancy loader too.

PARTIAL SIZE REPORTING COMMENT:

Having size be all-or-none was intentional. It seemed very confusion for pig to 
base a decision on one number (and log that input size) then have the MR job 
read a different amount of data. I think its best to keep the current behavior 
and only make this optimization if its based on the actual input size.

METHOD NAME COMMENT:

How does {{getInputSizeFromLoader}} sound?
                
> Automagically setting parallelism based on input file size does not work with 
> HCatalog
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-2573
>                 URL: https://issues.apache.org/jira/browse/PIG-2573
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Travis Crawford
>            Assignee: Travis Crawford
>         Attachments: PIG-2573_get_size_from_stats_if_possible.diff, 
> PIG-2573_move_getinputbytes_to_loadfunc.diff
>
>
> PIG-2334 was helpful in understanding this issue. Short version is input file 
> size is only computed if the path begins with a whitelisted prefix, currently:
> * /
> * hdfs:
> * file:
> * s3n:
> As HCatalog locations use the form {{dbname.tablename}} the input file size 
> is not computed, and the size-based parallelism optimization breaks.
> DETAILS:
> I discovered this issue comparing two runs on the same script, one loading 
> regular HDFS paths, and one with HCatalog db.table names. I just happened to 
> notice the "Setting number of reducers" line difference.
> {code:title=Loading HDFS files reducers is set to 99}
> 2012-03-08 01:33:56,522 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>  - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=98406674162
> 2012-03-08 01:33:56,522 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>  - Neither PARALLEL nor default parallelism is set for this job. Setting 
> number of reducers to 99
> {code}
> {code:title=Loading with an HCatalog db.table name}
> 2012-03-08 01:06:02,283 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>  - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0
> 2012-03-08 01:06:02,283 [main] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>  - Neither PARALLEL nor default parallelism is set for this job. Setting 
> number of reducers to 1
> {code}
> Possible fix: Pig should just ask the loader for the size of its inputs 
> rather than special-casing certain location types.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2573) Automagically setting parallelism based on input file size does not work with HCatalog

Reply via email to