Re: storing to different files

Jennie Cochran-Chinn Tue, 02 Feb 2010 10:22:39 -0800

Thanks for the clarification. We went down the path of using a UDFinside the FOREACH after the GROUP as yes, there are >5k uniquegroups. We cant reduce the number of unique groups as there is adownstream application whose requirements we must meet.


To further the question, our current solution is of this form:


A - load 'data';

B = group A by $0;

C = foreach B storeUdf(*);

where storeUdf(*) opens a storage stream for the individual groups andwe get around the # of open streams issue. Do you have any pointerson opening/closing the stream and binding to PigStorage inside thestoreUdf function? We mimick how MultiStorage opens and closesstreams/PigStorage - is there anything else there I should be lookingout for or is that pretty standard?


Thanks!
Jennie

On Feb 2, 2010, at 4:33 AM, Ankur C. Goel wrote:

Jennie,
A hadoop cluster has an enforced limit on the number ofconcurrent streams that can be kept open at any time.This limit is the number of concurrent threads that a Datanode canrun for doing I/O specified by the cluster level job configparameter - dfs.datanode.max.xcievers.So the max number of open streams = Number of nodes * threads perdatanode.
MultiStore can do what you want but is constrained by the abovelimit and not by itself coz going past the limit will cause datanodeto drop connections.Also it is not a good idea to use MultiStore if you expect more thanfew thousand unique groups as outputs from your reducers.
Try reducing the number of unique groups before storing. You shouldbe able to do it via a simple UDF.
-...@nkur


On 2/1/10 11:32 AM, "Rekha Joshi" <[email protected]> wrote:
If it pig0.3 or higher you would be able to just use STORE commandmultiple times in the pig script to store results directly into hdfs.
A = LOAD ...
...
B = GROUP A ...
C = GROUP A ...
...
STORE B ...
STORE C ...

Also look into 
http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
I do not how he is your data set,but you might be able to increasethe memory parameters to be able to do it in single script.
Cheers,
/R
On 1/30/10 7:36 AM, "Jennie Cochran-Chinn" <[email protected]>wrote:
I had a question about storing data to different files.  The basic
jist of what we are doing is taking a large set of data, performing a
group by and then storing each group's dataBag into a distinct file
(on S3).  Currently we are using a UDF inside a FOREACH loop that
writes the dataBag to a local tmp file and then pushes it to S3.  This
does not seem to be the ideal way to do this and we were wondering if
anyone had any suggestions.  I know there is the MultiStore function
in the piggybank, but given that we have many different groups, it
does not appear that would scale very well.  For instance, in some
experiments the cluster I was using could not open new streams and
thus failed.

Thanks,
Jennie

Re: storing to different files

Reply via email to