Re: Sort Merge Bucket - Action Items

Robert Bradshaw Mon, 22 Jul 2019 07:49:50 -0700

On Mon, Jul 22, 2019 at 4:04 PM Neville Li <[email protected]> wrote:
>
> Thanks Robert. Agree with the FileIO point. I'll look into it and see what 
> needs to be done.
>
> Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So for 
> writes I'll probably build on top of WriteFiles.


Meaning it could be parameterized by FileIO.Sink, right?

https://github.com/apache/beam/blob/release-2.13.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L779

> Read might be a bigger change w.r.t. collocating ordered elements across 
> files within a bucket and TBH I'm not even sure where to start.

Yeah, here we need an interface that gives us ReadableFile ->
Iterable<T>. There are existing PTransform<PCollection<ReadableFile>,
PCollection<T>> but such an interface is insufficient to extract
ordered records per shard. It seems the only concrete implementations
are based on FileBasedSource, which we'd like to avoid, but there's no
alternative. An SDF, if exposed, would likely be overkill and
cumbersome to call (given the reflection machinery involved in
invoking DoFns).

> I'll file separate PRs for core changes needed for discussion. WDYT?

Sounds good.

> On Mon, Jul 22, 2019 at 4:20 AM Robert Bradshaw <[email protected]> wrote:
>>
>> On Fri, Jul 19, 2019 at 5:16 PM Neville Li <[email protected]> wrote:
>> >
>> > Forking this thread to discuss action items regarding the change. We can 
>> > keep technical discussion in the original thread.
>> >
>> > Background: our SMB POC showed promising performance & cost saving 
>> > improvements and we'd like to adopt it for production soon (by EOY). We 
>> > want to contribute it to Beam so it's better generalized and maintained. 
>> > We also want to avoid divergence between our internal version and the PR 
>> > while it's in progress, specifically any breaking change in the produced 
>> > SMB data.
>>
>> All good goals.
>>
>> > To achieve that I'd like to propose a few action items.
>> >
>> > 1. Reach a consensus about bucket and shard strategy, key handling, bucket 
>> > file and metadata format, etc., anything that affect produced SMB data.
>> > 2. Revise the existing PR according to #1
>> > 3. Reduce duplicate file IO logic by reusing FileIO.Sink, Compression, 
>> > etc., but keep the existing file level abstraction
>> > 4. (Optional) Merge code into extensions::smb but mark clearly as 
>> > @experimental
>> > 5. Incorporate ideas from the discussion, e.g. ShardingFn, 
>> > GroupByKeyAndSortValues, FileIO generalization, key URN, etc.
>> >
>> > #1-4 gives us something usable in the short term, while #1 guarantees that 
>> > production data produced today are usable when #5 lands on master. #4 also 
>> > gives early adopters a chance to give feedback.
>> > Due to the scope of #5, it might take much longer and a couple of big PRs 
>> > to achieve, which we can keep iterating on.
>> >
>> > What are your thoughts on this?
>>
>> I would like to see some resolution on the FileIO abstractions before
>> merging into experimental. (We have a FileBasedSink that would mostly
>> already work, so it's a matter of coming up with an analogous Source
>> interface.) Specifically I would not want to merge a set of per file
>> type smb IOs without a path forward to this or the determination that
>> it's not possible/desirable.

Re: Sort Merge Bucket - Action Items

Reply via email to