Rohini Palaniswamy created PIG-3669:
---------------------------------------

             Summary: Wrong Usage of Scalar can bring down Namenode
                 Key: PIG-3669
                 URL: https://issues.apache.org/jira/browse/PIG-3669
             Project: Pig
          Issue Type: Bug
            Reporter: Rohini Palaniswamy


People often get confused and end up using . after a join instead of ::  
(PIG-2134) to reference fields of the join. Pig takes it to be a scalar and 
tries to read the whole join data assuming there is only one record and then 
fails with "scalar has more than one row in the output". ReadScalars uses 
InterStorage to read the scalar data. It uses InterInputFormat which extends 
FileInputFormat and for getSplits(), there is a listStatus on all the files in 
the dir to construct the splits and then InterRecordReader is used on the 
splits to read the data.  When the data is really huge with lot of files, this 
can cause the namenode to go out of memory and crash and especially with the 
recent optimization change in hadoop 0.23/2.x that uses listLocatedStatus 
instead of listStatus + getBlockLocations in FileInputFormat to reduce number 
of calls to Namenode. 

 In the particular case we encountered, the join output had 6.5K files. On the 
job that did ReadScalars, it had 8K+ tasks (with speculative execution) and all 
of them doing listStatus for the 6.5K files caused NN queue to fill up with 
huge responses (block locations makes the response even bigger). It also 
saturated the network, causing responses to build up in NN without being sent 
finally leading it to crash with OOM.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to