Not a problem- glad we got it fixed. On Feb 20, 2013 2:31 PM, "Mike Barretta" <[email protected]> wrote:
> Was using a very early 0.5.0-incubating build, with hadoop 0.20.2, but > just did a fresh git pull and now with 0.6.0-incubating, things look better > (MessageData and RelationshipData are my parents with children): > > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [MessageData] > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new > path: /Synthesys/MessageData > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new > path: /Synthesys/Contexts > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new > path: /Synthesys/ContextualElements > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading > [RelationshipData] > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new > path: /Synthesys/RelationshipData > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new > path: /Synthesys/RelationshipStructures > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ElementData] > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new > path: /Synthesys/ElementData > 13/02/20 19:25:04 INFO extract.SynthesysKBExtractor: reading [ConceptData] > 13/02/20 19:25:04 INFO impl.FileTargetImpl: Will write output files to new > path: /Synthesys/ConceptData > > I'll try a few more times and let you know if anything funky happens. > > Thanks, as always, for your prompt responses, > Mike > > > On Wed, Feb 20, 2013 at 1:06 PM, Josh Wills <[email protected]> wrote: > >> Hey Mike, >> >> I can't replicate this problem using the MultipleOutputIT (which I think >> we added as a test for this problem.) Which version of Crunch and Hadoop >> are you using? The 0.5.0-incubating release should be up on the maven repos >> if you want to try that out. >> >> J >> >> >> On Wed, Feb 20, 2013 at 6:43 AM, Josh Wills <[email protected]> wrote: >> >>> Hey Mike, >>> >>> The code looks right to me. Let me whip up a test and see if I can >>> replicate it easily-- is there anything funky beyond what's in your snippet >>> that I should be aware of? >>> >>> J >>> >>> >>> On Wed, Feb 20, 2013 at 6:02 AM, Mike Barretta >>> <[email protected]>wrote: >>> >>>> I have a number of "tables" in HDFS, represented as folders containing >>>> SequenceFiles of serialized objects. I'm trying to write a tool that will >>>> reassemble these objects and output each of the tables into its own CSV >>>> file. >>>> >>>> The wrinkle is that some of the "tables" hold objects with a list of >>>> related child objects. Those related should get chopped into their own >>>> table. >>>> >>>> Here is essentially what my loop looks like (in Groovy): >>>> >>>> //loop through each top-level table >>>> paths.each { path -> >>>> def source = From.sequenceFile(new Path(path), >>>> >>>> Writables.writables(ColumnKey.class), >>>> >>>> Writables.writables(ColumnDataArrayWritable.class) >>>> ) >>>> >>>> //read it in >>>> def data = crunchPipeline.read(source) >>>> >>>> //write it out >>>> crunchPipeline.write( >>>> data.parallelDo(new MyDoFn(path), Writables.strings()), >>>> To.textFile("$path/csv") >>>> ) >>>> >>>> //handle children using same PTable as parent >>>> if (path == TABLE_MESSAGE_DATA) { >>>> messageChildPaths.each { childPath -> >>>> crunchPipeline.write( >>>> data.parallelDo(new MyDoFn(childPath), >>>> Writables.strings()), >>>> To.textFile("$childPath/csv") >>>> ) >>>> } >>>> } >>>> } >>>> >>>> The parent and child jobs generally get grouped into a single map job, >>>> but most of the time, only some of the children tables get included, which >>>> is to say, sometimes a child table does not get output. There doesn't seem >>>> to be a pattern - sometimes all of them get included, sometimes 1 or 2. >>>> >>>> Am I missing something? Is there a way to specify which jobs should be >>>> combined? >>>> >>>> Thanks, >>>> Mike >>>> >>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > >
