batch on dataflowRunner. On Wed, Sep 18, 2019 at 4:05 PM Reuven Lax <re...@google.com> wrote:
> Are you using streaming or batch? Also which runner are you using? > > On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan <joseph.dun...@liveramp.com> > wrote: > >> So I followed up on why TextIO shuffles and dug into the code some. It is >> using the shards and getting all the values into a keyed group to write to >> a single file. >> >> However... I wonder if there is way to just take the records that are on >> a worker and write them out. Thus not needing a shard number and doing >> this. Closer to how hadoop handle's writes. >> >> Maybe just a regular pardo and on bundleSetup it creates a writer and >> processElement reuses that writter to write to the same file for all >> elements within a bundle? >> >> I feel like this goes beyond scope of simple user mailing list so I'm >> expanding it to dev as well. >> +dev <dev@beam.apache.org> >> >> Finding a solution that prevents quadrupling shuffle costs when simply >> writing out a file is a necessity for large scale jobs that work with 100+ >> TB of data. If anyone has any ideas I'd love to hear them. >> >> Thanks, >> Shannon Duncan >> >> On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan < >> joseph.dun...@liveramp.com> wrote: >> >>> We have been using Beam for a bit now. However we just turned on the >>> dataflow shuffle service and were very surprised that the shuffled data >>> amounts were quadruple the amounts we expected. >>> >>> Turns out that the file writing TextIO is doing shuffles within itself. >>> >>> Is there a way to prevent shuffling in the writing phase? >>> >>> Thanks, >>> Shannon Duncan >>> >>