Cache split information details for data with large number of small part files
------------------------------------------------------------------------------

                 Key: PIG-1972
                 URL: https://issues.apache.org/jira/browse/PIG-1972
             Project: Pig
          Issue Type: Improvement
          Components: impl
    Affects Versions: 0.8.0
         Environment: Pig 0.8 version with PigMix 
http://wiki.apache.org/pig/PigMix
            Reporter: Rajesh Balamohan


While running scalability benchmarks with Pig 0.8 & PigMix, L14 query listed in 
http://wiki.apache.org/pig/PigMix showed no scalability characteristics (i.e, 
for the same problem size response time should decrease as we increase the 
number of nodes)

Investigating further revealed that L14 query merge-joins small dataset and 
another large dataset. If the small dataset has many part files with very 
little amount of data, it causes a huge pressure on NameNode. This is because 
it is read as a side file in all map slows.

In the environment where I ran the experiment, small dataset was spread across 
1900+ part files in HDFS.

Following codepath has the perf issue.
DefaultIndexableLoader--> seekNear() --> initRightLoader() is causing the huge 
delay. Since
"users_sorted" data is spread across 1900+ small files, 
FileInputFormat.getSplits() hits the namenode too
frequently. 

i.e, (number of machines * number of map slots * 1900+ times). This is the 
reason why L14 is not scaling up.


Suggestion would be to cache the splitInformation of the small dataset instead 
of hitting the namenode too frequently.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to