Hi Josh. A follow up just to check I've got this straight.

I've amended my pipeline and added a "pipeline.run()" call after the write
to HDFS. Now I do get two mapreduce jobs, but instead of the second
carrying on where the first left off, it actually re-does all the steps
needed to generate the PCollection that was written. I get the same jobs A
and B I described in my original email, but running sequentially rather
than in parallel. Is that what you'd expect?

So I guess what I have to do following the write is re-read from the output
path using pipeline.read(From.avroFile(...)).

It'd be good if the pipeline could hold onto information about PCollections
even after they're written, so that they can be used by follow-on steps.
I'll file a JIRA to this effect so we can discuss it there.

Thanks,
Dave


On 15 January 2013 21:00, Dave Beech <[email protected]> wrote:

> Thanks Josh - that's great. I'll file a JIRA about the side-outputs
> feature, but the pipeline.run() call will serve my purpose for now.
>
> Cheers,
> Dave
>
> On 15 January 2013 18:03, Josh Wills <[email protected]> wrote:
>
>> Hey Dave,
>>
>> The way to force a sequential run would be to call pipeline.run() after
>> you write D to HDFS and before you declare the operations in step 6. What
>> we would really want here is a single MapReduce job that wrote side outputs
>> on the map side to create the dataset in step D, but we don't have support
>> for side-outputs in maps yet. Worth filing a JIRA, I think.
>>
>> Thanks!
>> Josh
>>
>
>
>

Reply via email to