Thanks. I responded to comments in the doc. More inline.

On Thu, Jun 27, 2019 at 2:44 PM Chamikara Jayalath <chamik...@google.com>
wrote:

> Thanks added few comments.
>
> If I understood correctly, you basically assign elements with keys to
> different buckets which are written to unique files and merge files for the
> same key while reading ?
>
> Some of my concerns are.
>
> (1)  Seems like you rely on an in-memory sorting of buckets. Will this end
> up limiting the size of a PCollection you can process ?
>
The sorter transform we're using supports spilling and external sort. We
can break up large key groups further by sharding, similar to fan out in
some GBK transforms.

(2) Seems like you rely on Reshuffle.viaRandomKey() which is actually
> implemented using a shuffle (which you try to replace with this proposal).
>
That's for distributing task metadata, so that each DoFn thread picks up a
random bucket and sort merge key-values. It's not shuffling actual data.


> (3) I think (at least some of the) shuffle implementations are implemented
> in ways similar to this (writing to files and merging). So I'm wondering if
> the performance benefits you see are for a very specific case and may limit
> the functionality in other ways.
>
This is for the common pattern of few core data producer pipelines and many
downstream consumer pipelines. It's not intended to replace shuffle/join
within a single pipeline. On the producer side, by pre-grouping/sorting
data and writing to bucket/shard output files, the consumer can sort/merge
matching ones without a CoGBK. Essentially we're paying the shuffle cost
upfront to avoid them repeatedly in each consumer pipeline that wants to
join data.


> Thanks,
> Cham
>
>
> On Thu, Jun 27, 2019 at 8:12 AM Neville Li <neville....@gmail.com> wrote:
>
>> Ping again. Any chance someone takes a look to get this thing going? It's
>> just a design doc and basic metadata/IO impl. We're not talking about
>> actual source/sink code yet (already done but saved for future PRs).
>>
>> On Fri, Jun 21, 2019 at 1:38 PM Ahmet Altay <al...@google.com> wrote:
>>
>>> Thank you Claire, this looks promising. Explicitly adding a few folks
>>> that might have feedback: +Ismaël Mejía <ieme...@gmail.com> +Robert
>>> Bradshaw <rober...@google.com> +Lukasz Cwik <lc...@google.com> +Chamikara
>>> Jayalath <chamik...@google.com>
>>>
>>> On Mon, Jun 17, 2019 at 2:12 PM Claire McGinty <
>>> claire.d.mcgi...@gmail.com> wrote:
>>>
>>>> Hey dev@!
>>>>
>>>> Myself and a few other Spotify data engineers have put together a design
>>>> doc for SMB Join support in Beam
>>>> <https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit?usp=sharing>,
>>>>  and
>>>> have a working Java implementation we've started to put up for PR ([0
>>>> <https://github.com/apache/beam/pull/8823>], [1
>>>> <https://github.com/apache/beam/pull/8824>], [2
>>>> <https://github.com/apache/beam/pull/8486>]). There's more detailed
>>>> information in the document, but the tl;dr is that SMB is a strategy to
>>>> optimize joins for file-based sources by modifying the initial write
>>>> operation to write records in sorted buckets based on the desired join key.
>>>> This means that subsequent joins of datasets written in this way are only
>>>> sequential file reads, no shuffling involved. We've seen some pretty
>>>> substantial performance speedups with our implementation and would love to
>>>> get it checked in to Beam's Java SDK.
>>>>
>>>> We'd appreciate any suggestions or feedback on our proposal--the design
>>>> doc should be public to comment on.
>>>>
>>>> Thanks!
>>>> Claire / Neville
>>>>
>>>

Reply via email to