Out of curiosity, how many tuples exist in each group's data bag? I'd imagine it's a highly variable number but what order of magnitude are you dealing with? I think it would make more sense to implement this in Java MapReduce using MultipleOutputs or MultipleOutputFormat as these classes were designed for this kind of thing. As much as I love Pig, sometimes you do have to resort to Java.
On Tue, Feb 2, 2010 at 1:21 PM, Jennie Cochran-Chinn <[email protected]>wrote: > Thanks for the clarification. We went down the path of using a UDF inside > the FOREACH after the GROUP as yes, there are >5k unique groups. We cant > reduce the number of unique groups as there is a downstream application > whose requirements we must meet. > > To further the question, our current solution is of this form: > > A - load 'data'; > > B = group A by $0; > > C = foreach B storeUdf(*); > > > where storeUdf(*) opens a storage stream for the individual groups and we > get around the # of open streams issue. Do you have any pointers on > opening/closing the stream and binding to PigStorage inside the storeUdf > function? We mimick how MultiStorage opens and closes streams/PigStorage - > is there anything else there I should be looking out for or is that pretty > standard? > > Thanks! > Jennie > > > On Feb 2, 2010, at 4:33 AM, Ankur C. Goel wrote: > > Jennie, >> A hadoop cluster has an enforced limit on the number of concurrent >> streams that can be kept open at any time. >> This limit is the number of concurrent threads that a Datanode can run for >> doing I/O specified by the cluster level job config parameter - >> dfs.datanode.max.xcievers. >> So the max number of open streams = Number of nodes * threads per >> datanode. >> >> MultiStore can do what you want but is constrained by the above limit and >> not by itself coz going past the limit will cause datanode to drop >> connections. >> Also it is not a good idea to use MultiStore if you expect more than few >> thousand unique groups as outputs from your reducers. >> >> Try reducing the number of unique groups before storing. You should be >> able to do it via a simple UDF. >> >> -...@nkur >> >> >> On 2/1/10 11:32 AM, "Rekha Joshi" <[email protected]> wrote: >> >> If it pig0.3 or higher you would be able to just use STORE command >> multiple times in the pig script to store results directly into hdfs. >> A = LOAD ... >> ... >> B = GROUP A ... >> C = GROUP A ... >> ... >> STORE B ... >> STORE C ... >> >> Also look into >> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution >> I do not how he is your data set,but you might be able to increase the >> memory parameters to be able to do it in single script. >> >> Cheers, >> /R >> >> On 1/30/10 7:36 AM, "Jennie Cochran-Chinn" <[email protected]> wrote: >> >> I had a question about storing data to different files. The basic >> jist of what we are doing is taking a large set of data, performing a >> group by and then storing each group's dataBag into a distinct file >> (on S3). Currently we are using a UDF inside a FOREACH loop that >> writes the dataBag to a local tmp file and then pushes it to S3. This >> does not seem to be the ideal way to do this and we were wondering if >> anyone had any suggestions. I know there is the MultiStore function >> in the piggybank, but given that we have many different groups, it >> does not appear that would scale very well. For instance, in some >> experiments the cluster I was using could not open new streams and >> thus failed. >> >> Thanks, >> Jennie >> >> >> > -- Zaki Rahaman
