Re: Prevent Shuffling on Writing Files

Shannon Duncan Thu, 19 Sep 2019 11:49:41 -0700

As a follow up the pricing as the number of bytes written + read to the
shuffle is confirmed.


However we were able to figure out a way to lower shuffle costs and things
are right in the world again.

Thanks ya'll!
Shannon

On Wed, Sep 18, 2019 at 4:52 PM Reuven Lax <re...@google.com> wrote:

> I believe that the Total shuffle data process counter counts the number of
> bytes written to shuffle + the number of bytes read. So if you shuffle 1GB
> of data, you should expect to see 2GB on the counter.
>
> On Wed, Sep 18, 2019 at 2:39 PM Shannon Duncan <joseph.dun...@liveramp.com>
> wrote:
>
>> Ok just ran the job on a small input and did not specify numShards. so
>> it's literally just:
>>
>> .apply("WriteLines", TextIO.write().to(options.getOutput()));
>>
>> Output of map for join:
>> [image: image.png]
>>
>> Details of Shuffle:
>> [image: image.png]
>>
>> Reported Bytes Shuffled:
>> [image: image.png]
>>
>>
>> On Wed, Sep 18, 2019 at 4:24 PM Reuven Lax <re...@google.com> wrote:
>>
>>>
>>>
>>> On Wed, Sep 18, 2019 at 2:12 PM Shannon Duncan <
>>> joseph.dun...@liveramp.com> wrote:
>>>
>>>> I will attempt to do without sharding (though I believe we did do a run
>>>> without shards and it incurred the extra shuffle costs).
>>>>
>>>
>>> It shouldn't. There will be a shuffle, but that shuffle should contain a
>>> small amount of data (essentially a list of filenames).
>>>
>>>>
>>>> Pipeline is simple.
>>>>
>>>> The only shuffle that is explicitly defined is the shuffle after
>>>> merging files together into a single PCollection (Flatten Transform).
>>>>
>>>> So it's a Read > Flatten > Shuffle > Map (Format) > Write. We expected
>>>> to pay for shuffles on the middle shuffle but were surprised to see that
>>>> the output data from the Flatten was quadrupled in the reflected shuffled
>>>> GB shown in Dataflow. Which lead me down this path of finding things.
>>>>
>>>> [image: image.png]
>>>>
>>>> On Wed, Sep 18, 2019 at 4:08 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> In that case you should be able to leave sharding unspecified, and you
>>>>> won't incur the extra shuffle. Specifying explicit sharding is generally
>>>>> necessary only for streaming.
>>>>>
>>>>> On Wed, Sep 18, 2019 at 2:06 PM Shannon Duncan <
>>>>> joseph.dun...@liveramp.com> wrote:
>>>>>
>>>>>> batch on dataflowRunner.
>>>>>>
>>>>>> On Wed, Sep 18, 2019 at 4:05 PM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>> Are you using streaming or batch? Also which runner are you using?
>>>>>>>
>>>>>>> On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan <
>>>>>>> joseph.dun...@liveramp.com> wrote:
>>>>>>>
>>>>>>>> So I followed up on why TextIO shuffles and dug into the code some.
>>>>>>>> It is using the shards and getting all the values into a keyed group to
>>>>>>>> write to a single file.
>>>>>>>>
>>>>>>>> However... I wonder if there is way to just take the records that
>>>>>>>> are on a worker and write them out. Thus not needing a shard number and
>>>>>>>> doing this. Closer to how hadoop handle's writes.
>>>>>>>>
>>>>>>>> Maybe just a regular pardo and on bundleSetup it creates a writer
>>>>>>>> and processElement reuses that writter to write to the same file for 
>>>>>>>> all
>>>>>>>> elements within a bundle?
>>>>>>>>
>>>>>>>> I feel like this goes beyond scope of simple user mailing list so
>>>>>>>> I'm expanding it to dev as well.
>>>>>>>> +dev <dev@beam.apache.org>
>>>>>>>>
>>>>>>>> Finding a solution that prevents quadrupling shuffle costs when
>>>>>>>> simply writing out a file is a necessity for large scale jobs that work
>>>>>>>> with 100+ TB of data. If anyone has any ideas I'd love to hear them.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shannon Duncan
>>>>>>>>
>>>>>>>> On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan <
>>>>>>>> joseph.dun...@liveramp.com> wrote:
>>>>>>>>
>>>>>>>>> We have been using Beam for a bit now. However we just turned on
>>>>>>>>> the dataflow shuffle service and were very surprised that the 
>>>>>>>>> shuffled data
>>>>>>>>> amounts were quadruple the amounts we expected.
>>>>>>>>>
>>>>>>>>> Turns out that the file writing TextIO is doing shuffles within
>>>>>>>>> itself.
>>>>>>>>>
>>>>>>>>> Is there a way to prevent shuffling in the writing phase?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shannon Duncan
>>>>>>>>>
>>>>>>>>

Re: Prevent Shuffling on Writing Files

Reply via email to