Ankur C. Goel
Tue, 02 Feb 2010 04:34:50 -0800
Jennie,
A hadoop cluster has an enforced limit on the number of concurrent
streams that can be kept open at any time.
This limit is the number of concurrent threads that a Datanode can run for
doing I/O specified by the cluster level job config parameter -
dfs.datanode.max.xcievers.
So the max number of open streams = Number of nodes * threads per datanode.
MultiStore can do what you want but is constrained by the above limit and not by itself coz going past the limit will cause datanode to drop connections. Also it is not a good idea to use MultiStore if you expect more than few thousand unique groups as outputs from your reducers. Try reducing the number of unique groups before storing. You should be able to do it via a simple UDF. -...@nkur On 2/1/10 11:32 AM, "Rekha Joshi" <rekha...@yahoo-inc.com> wrote: If it pig0.3 or higher you would be able to just use STORE command multiple times in the pig script to store results directly into hdfs. A = LOAD ... ... B = GROUP A ... C = GROUP A ... ... STORE B ... STORE C ... Also look into http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution I do not how he is your data set,but you might be able to increase the memory parameters to be able to do it in single script. Cheers, /R On 1/30/10 7:36 AM, "Jennie Cochran-Chinn" <jcoch...@adconion.com> wrote: I had a question about storing data to different files. The basic jist of what we are doing is taking a large set of data, performing a group by and then storing each group's dataBag into a distinct file (on S3). Currently we are using a UDF inside a FOREACH loop that writes the dataBag to a local tmp file and then pushes it to S3. This does not seem to be the ideal way to do this and we were wondering if anyone had any suggestions. I know there is the MultiStore function in the piggybank, but given that we have many different groups, it does not appear that would scale very well. For instance, in some experiments the cluster I was using could not open new streams and thus failed. Thanks, Jennie