Rohini Palaniswamy created PIG-3669:
---------------------------------------
Summary: Wrong Usage of Scalar can bring down Namenode
Key: PIG-3669
URL: https://issues.apache.org/jira/browse/PIG-3669
Project: Pig
Issue Type: Bug
Reporter: Rohini Palaniswamy
People often get confused and end up using . after a join instead of ::
(PIG-2134) to reference fields of the join. Pig takes it to be a scalar and
tries to read the whole join data assuming there is only one record and then
fails with "scalar has more than one row in the output". ReadScalars uses
InterStorage to read the scalar data. It uses InterInputFormat which extends
FileInputFormat and for getSplits(), there is a listStatus on all the files in
the dir to construct the splits and then InterRecordReader is used on the
splits to read the data. When the data is really huge with lot of files, this
can cause the namenode to go out of memory and crash and especially with the
recent optimization change in hadoop 0.23/2.x that uses listLocatedStatus
instead of listStatus + getBlockLocations in FileInputFormat to reduce number
of calls to Namenode.
In the particular case we encountered, the join output had 6.5K files. On the
job that did ReadScalars, it had 8K+ tasks (with speculative execution) and all
of them doing listStatus for the 6.5K files caused NN queue to fill up with
huge responses (block locations makes the response even bigger). It also
saturated the network, causing responses to build up in NN without being sent
finally leading it to crash with OOM.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)