Cache split information details for data with large number of small part files
------------------------------------------------------------------------------
Key: PIG-1972
URL: https://issues.apache.org/jira/browse/PIG-1972
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.8.0
Environment: Pig 0.8 version with PigMix
http://wiki.apache.org/pig/PigMix
Reporter: Rajesh Balamohan
While running scalability benchmarks with Pig 0.8 & PigMix, L14 query listed in
http://wiki.apache.org/pig/PigMix showed no scalability characteristics (i.e,
for the same problem size response time should decrease as we increase the
number of nodes)
Investigating further revealed that L14 query merge-joins small dataset and
another large dataset. If the small dataset has many part files with very
little amount of data, it causes a huge pressure on NameNode. This is because
it is read as a side file in all map slows.
In the environment where I ran the experiment, small dataset was spread across
1900+ part files in HDFS.
Following codepath has the perf issue.
DefaultIndexableLoader--> seekNear() --> initRightLoader() is causing the huge
delay. Since
"users_sorted" data is spread across 1900+ small files,
FileInputFormat.getSplits() hits the namenode too
frequently.
i.e, (number of machines * number of map slots * 1900+ times). This is the
reason why L14 is not scaling up.
Suggestion would be to cache the splitInformation of the small dataset instead
of hitting the namenode too frequently.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira