Hey Mike, The code looks right to me. Let me whip up a test and see if I can replicate it easily-- is there anything funky beyond what's in your snippet that I should be aware of?
J On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta <[email protected]>wrote: > I have a number of "tables" in HDFS, represented as folders containing > SequenceFiles of serialized objects. I'm trying to write a tool that will > reassemble these objects and output each of the tables into its own CSV > file. > > The wrinkle is that some of the "tables" hold objects with a list of > related child objects. Those related should get chopped into their own > table. > > Here is essentially what my loop looks like (in Groovy): > > //loop through each top-level table > paths.each { path -> > def source = From.sequenceFile(new Path(path), > > Writables.writables(ColumnKey.class), > > Writables.writables(ColumnDataArrayWritable.class) > ) > > //read it in > def data = crunchPipeline.read(source) > > //write it out > crunchPipeline.write( > data.parallelDo(new MyDoFn(path), Writables.strings()), > To.textFile("$path/csv") > ) > > //handle children using same PTable as parent > if (path == TABLE_MESSAGE_DATA) { > messageChildPaths.each { childPath -> > crunchPipeline.write( > data.parallelDo(new MyDoFn(childPath), > Writables.strings()), > To.textFile("$childPath/csv") > ) > } > } > } > > The parent and child jobs generally get grouped into a single map job, but > most of the time, only some of the children tables get included, which is > to say, sometimes a child table does not get output. There doesn't seem to > be a pattern - sometimes all of them get included, sometimes 1 or 2. > > Am I missing something? Is there a way to specify which jobs should be > combined? > > Thanks, > Mike > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
