[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--------------------------

    Release Note: 
Feature: combine splits of sizes smaller than the value of property 
"pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is 
not set, the file system default block size of the load's location. This 
feature can be turned off through setting the property "pig.splitCombination" 
to "false". When such a combination is performed, a log message like "Total 
input paths (combined) to process : 7" will be logged. 

This feature will be applicable if a user input, or an intermediate input, has 
many small files to be loaded that would otherwise cause many more "under-fed" 
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader 
implementation makes use of the PigSplit object passed through the 
prepareToRead method where a rebuild of the loader might be necessary as 
PigSplit's definition has been modified. However, currently we know of no 
external use of the object.

This change also requires the loader to be stateless across the invocations to 
the prepareToRead method. That is, the method should reset any internal states 
that are not affected by the RecordReader argument.
Otherwise, this feature should be disabled.

In addition, if a loader implements IndexableLoadFunc, or implements 
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to 
possible combinations.

  was:
Feature: combine splits of sizes smaller than the value of property 
"pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is 
not set, the file system default block size of the load's location. This 
feature can be turned off through setting the property "pig.noSplitCombination" 
to true. When such a combination is performed, a log message like "Total input 
paths (combined) to process : 7" will be logged. 

This feature will be applicable if a user input, or an intermediate input, has 
many small files to be loaded that would otherwise cause many more "under-fed" 
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader 
implementation makes use of the PigSplit object passed through the 
prepareToRead method where a rebuild of the loader might be necessary as 
PigSplit's definition has been modified. However, currently we know of no 
external use of the object.

In addition, if a loader implements IndexableLoadFunc, or implements 
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to 
possible combinations.


> multi file input format for loaders
> -----------------------------------
>
>                 Key: PIG-1518
>                 URL: https://issues.apache.org/jira/browse/PIG-1518
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>         Attachments: PIG-1518-0.7.0.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to