Was using a very early 0.5.0-incubating build, with hadoop 0.20.2, but just did a fresh git pull and now with 0.6.0-incubating, things look better (MessageData and RelationshipData are my parents with children):
13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [MessageData] 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/MessageData 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/Contexts 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/ContextualElements 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [RelationshipData] 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/RelationshipData 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/RelationshipStructures 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ElementData] 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/ElementData 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ConceptData] 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new path: /Synthesys/ConceptData I'll try a few more times and let you know if anything funky happens. Thanks, as always, for your prompt responses, Mike On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills <[email protected]> wrote: > Hey Mike, > > I can't replicate this problem using the MultipleOutputIT (which I think > we added as a test for this problem.) Which version of Crunch and Hadoop > are you using? The 0.5.0-incubating release should be up on the maven repos > if you want to try that out. > > J > > > On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <[email protected]> wrote: > >> Hey Mike, >> >> The code looks right to me. Let me whip up a test and see if I can >> replicate it easily-- is there anything funky beyond what's in your snippet >> that I should be aware of? >> >> J >> >> >> On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta >> <[email protected]>wrote: >> >>> I have a number of "tables" in HDFS, represented as folders containing >>> SequenceFiles of serialized objects. I'm trying to write a tool that will >>> reassemble these objects and output each of the tables into its own CSV >>> file. >>> >>> The wrinkle is that some of the "tables" hold objects with a list of >>> related child objects. Those related should get chopped into their own >>> table. >>> >>> Here is essentially what my loop looks like (in Groovy): >>> >>> //loop through each top-level table >>> paths.each { path -> >>> def source = From.sequenceFile(new Path(path), >>> >>> Writables.writables(ColumnKey.class), >>> >>> Writables.writables(ColumnDataArrayWritable.class) >>> ) >>> >>> //read it in >>> def data = crunchPipeline.read(source) >>> >>> //write it out >>> crunchPipeline.write( >>> data.parallelDo(new MyDoFn(path), Writables.strings()), >>> To.textFile("$path/csv") >>> ) >>> >>> //handle children using same PTable as parent >>> if (path == TABLE_MESSAGE_DATA) { >>> messageChildPaths.each { childPath -> >>> crunchPipeline.write( >>> data.parallelDo(new MyDoFn(childPath), >>> Writables.strings()), >>> To.textFile("$childPath/csv") >>> ) >>> } >>> } >>> } >>> >>> The parent and child jobs generally get grouped into a single map job, >>> but most of the time, only some of the children tables get included, which >>> is to say, sometimes a child table does not get output. There doesn't seem >>> to be a pattern - sometimes all of them get included, sometimes 1 or 2. >>> >>> Am I missing something? Is there a way to specify which jobs should be >>> combined? >>> >>> Thanks, >>> Mike >>> >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
