Satish Subhashrao Saley created PIG-5365:

             Summary: Add support for PARALLEL clause in LOAD statement
                 Key: PIG-5365
             Project: Pig
          Issue Type: New Feature
            Reporter: Satish Subhashrao Saley
            Assignee: Satish Subhashrao Saley

It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
512MB or 1G when they are reading TBs of data to avoid launching too many map 
tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
container launch and wastes lot of resources. 

Would be good to have a new settings to configure the max number of tasks which 
will override pig.maxCombinedSplitSize and combine more splits into one task. 
For eg: pig.max.input.splits=30000 and data size is 2TB, it will combine more 
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K 
tasks. That will go as default into and apply to all 


This message was sent by Atlassian JIRA

Reply via email to