Re: Prevent Shuffling on Writing Files

Shannon Duncan Wed, 18 Sep 2019 14:07:48 -0700

batch on dataflowRunner.

On Wed, Sep 18, 2019 at 4:05 PM Reuven Lax <re...@google.com> wrote:


> Are you using streaming or batch? Also which runner are you using?
>
> On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan <joseph.dun...@liveramp.com>
> wrote:
>
>> So I followed up on why TextIO shuffles and dug into the code some. It is
>> using the shards and getting all the values into a keyed group to write to
>> a single file.
>>
>> However... I wonder if there is way to just take the records that are on
>> a worker and write them out. Thus not needing a shard number and doing
>> this. Closer to how hadoop handle's writes.
>>
>> Maybe just a regular pardo and on bundleSetup it creates a writer and
>> processElement reuses that writter to write to the same file for all
>> elements within a bundle?
>>
>> I feel like this goes beyond scope of simple user mailing list so I'm
>> expanding it to dev as well.
>> +dev <dev@beam.apache.org>
>>
>> Finding a solution that prevents quadrupling shuffle costs when simply
>> writing out a file is a necessity for large scale jobs that work with 100+
>> TB of data. If anyone has any ideas I'd love to hear them.
>>
>> Thanks,
>> Shannon Duncan
>>
>> On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> We have been using Beam for a bit now. However we just turned on the
>>> dataflow shuffle service and were very surprised that the shuffled data
>>> amounts were quadruple the amounts we expected.
>>>
>>> Turns out that the file writing TextIO is doing shuffles within itself.
>>>
>>> Is there a way to prevent shuffling in the writing phase?
>>>
>>> Thanks,
>>> Shannon Duncan
>>>
>>

Re: Prevent Shuffling on Writing Files

Reply via email to