How mature is Flink batch? I am wondering if we can with Flink Batch APIs. Hudi itself will provide micro-batching semantics (incremental pull, upserts)? For true streaming performace, I am not sure even Cloud stores are ready, since none of them support append()s. IMHO a mini/micro batch model makes sense for the kind of space Hudi solves.. We can keep working towards making this
I have some thoughts on indexing, a new way to index, that complements BloomIndex and overcomes some of the spark caching needs. Will write it down over the weekend and share it.. P.S: I am also wondering if we should generalize HIPs into a RFC process[1], that way it does not have to be a real technical proposal and we can still have structured conversation on future roadmap etc. and evolve the shape together.. (May be this deserves its own separate DISCUSS thread) [1] - https://en.wikipedia.org/wiki/Request_for_Comments On Wed, Sep 25, 2019 at 9:07 PM Taher Koitawala <[email protected]> wrote: > Hi Vino, > Agree with your suggestion. We all know when thought Flink is > streaming we can control how files get rolled out through checkpointing > configurations. Bad config and small files get rolled out. Good config and > files are properly sized. > > Also I understand the concern of reading files with Flink and > performance related to it. (I have faced it before). So how about we built > our own functions which can read and write efficiently and is not a source > function but an operator! So what I mean is let's use the Akka based > semantics of passing messages between these operators and read what is > required. > > Have 1 custom stream operator who takes requests for reading files (Again > that is not a source function, it is an operator), that operator reads the > file and passes it downstream in a parallel manner. (May be AsyncIO > extension can be a better call here). Let me know your thoughts on this. > I.e: if we choose everything will be written on core DataStreams API > > As per the Flink batch and Stream talk goes. I guess as a community we have > already agreed that for batch the spark engine is good enough and streams > will be power by Flink where as beam will do both. By the time Flink > unification of batch and stream is completely our code will be batch > compatible with minimal changes. > > Further we need to plan what part of Flink API will we use, should we stick > to hard core DataStreams API or should we merge it with Table APIs(I think > we should) so that we get Append sink, retract sink, temporial tables, the > capability to use the new blink table planner, also it would to some extent > lift our work of reading files since the table ApI I believe is good at > doing those things and comes with in build readers and writers. So > basically if we use that we could also do instream join the stream source > and the Hudi files to recompute upserts etc. AFAIK Flink Table ApI is also > giving a .cache() like functionality now. (Saw conversations about it in > the mailing list) > > So I think we really need to start planning such things to move ahead. > Other than that 100% with Vino that we cannot read files otherwise in Flink > it would be really bad to do that. > > Regards, > Taher Koitawala > > On Thu, Sep 26, 2019, 8:47 AM vino yang <[email protected]> wrote: > > > Hi > > > > A simple example. In Hudi Project, you can find many code snippet like > > `spark.read().format().load()` The load method can pass any path, > > especially DFS paths. > > > > While if we only want to use Flink streaming, there is not a good way to > > read HDFS now. > > > > In addition, we.also need to consider other ability between Flink and > > Spark. You should know Spark API(non-structured streaming mode) can > support > > both Streaming(micro-batch) and batch. However, Flink distinguishs them > > with two differentAPI, they have different feature set. > > > > > > > > On 09/25/2019 13:15, Semantic Beeng <[email protected]> wrote: > > > > Hi Vino, > > > > Would you be kind to start a wiki page to discuss this deep understanding > > of the functionality and design of Hudi? > > > > There you can put git links (https://github.com/ben-gibson/GitLink for > > intellij) and design knowledge so we can discuss in context. > > > > I am exploring the approach from this retweet > > https://twitter.com/semanticbeeng/status/1176241250967666689?s=20 and > > need this understanding you have. > > > > "difficult to ignore Flink Batch API to match some features provide by > > Hudi now" - can you please post there some gitlinks to this? > > > > Thanks > > > > Nick > > > > > > > > > > > > > > On September 24, 2019 at 10:22 PM vino yang <[email protected]> > > wrote: > > > > Hi Taher, > > > > As I mentioned in the previous mail. Things may not be too easy by just > > using Flink state API. > > > > Copied here " Hudi can connect with many different Source/Sinks. Some > > file-based reads are not appropriate for Flink Streaming." > > > > Although, unify Batch and Streaming is Flink's goal. But, it is difficult > > to ignore Flink Batch API to match some features provide by Hudi now. > > > > The example you provided is in application layer about usage. So my > > suggestion is be patient, it needs time to give an detailed design. > > > > Best, > > Vino > > > > > > > > On 09/24/2019 22:38, Taher Koitawala <[email protected]> wrote: > > Hi All, > > Sample code to see how records tagging will be handled in > > Flink is posted on [1]. The main class to run the same is MockHudi.java > > with a sample path for checkpointing. > > > > As of now this is just a sample to know we should ke caching in Flink > > states with bare minimum configs. > > > > > > As per my experience I have cached around 10s of TBs in Flink rocksDB > > state > > with the right configs. So I'm sure it should work here as well. > > > > 1: > > > > > https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi > > > > Regards, > > Taher Koitawala > > > > > > On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar <[email protected]> wrote: > > > > > It wont be much different than the HBaseIndex we have today. Would like > > to > > > have always have an option like BloomIndex that does not need any > > external > > > dependencies. > > > The moment you bring an external data store in, someone becomes a DBA. > > :) > > > > > > On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng <[email protected] > > > > > wrote: > > > > > > > @vc can you see how ApacheCrail could be used to implement this at > > scale > > > > but also in a way that abstracts over both Spark and Flink? > > > > > > > > "Crail Store implements a hierarchical namespace across a cluster of > > RDMA > > > > interconnected storage resources such as DRAM or flash" > > > > > > > > https://crail.incubator.apache.org/overview/ > > > > > > > > + 2 cents > > > > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > > > > > > Cheers > > > > > > > > Nick > > > > > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar <[email protected]> > > > wrote: > > > > > > > > > > > > It could be much larger. :) imagine billions of keys each 32 bytes, > > > mapped > > > > to another 32 byte > > > > > > > > The advantage of the current bloom index is that its effectively > > stored > > > > with data itself and this reduces complexity in terms of keeping > index > > > and > > > > data consistent etc > > > > > > > > One orthogonal idea from long time ago that moves indexing out of > data > > > > storage and is generalizable > > > > > > > > https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > > > > > If someone here knows flink well and can implement some standalone > > flink > > > > code to mimic tagLocation() functionality and share with the group, > > that > > > > would be great. Lets worry about performance once we have a flink > DAG. > > I > > > > think this is a critical and most tricky piece in supporting flink. > > > > > > > > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil <[email protected] > > > > > > wrote: > > > > > > > > Hi Taher, > > > > > > > > I agree with this , if the state is becoming too large we should have > > an > > > > option of storing it in external state like File System or RocksDb. > > > > > > > > @Vinoth Chandar <[email protected]> can the state of > HoodieBloomIndex > > go > > > > beyond 10-15 GB > > > > > > > > Regards, > > > > Vinay Patil > > > > > > > > > > > > > > > > > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala <[email protected] > > > > > > wrote: > > > > > > > > >> Hey Guys, Any thoughts on the above idea? To handle > > HoodieBloomIndex > > > > with > > > > >> HeapState, RocksDBState and FsState but on Spark. > > > > >> > > > > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala < > [email protected]> > > > > > > >> wrote: > > > > >> > > > > >> > Hi Vinoth, > > > > >> > Having seen the doc and code. I understand the > > > > >> > HoodieBloomIndex mainly caches key and partition path. Can we > > > address > > > > >> how > > > > >> > Flink does it? Like, have HeapState where the user chooses to > > cache > > > > the > > > > >> > Index on heap, RockDBState where indexes are written to RocksDB > > and > > > > >> finally > > > > >> > FsState where indexes can be written to HDFS, S3, Azure Fs. And > > on > > > > top, > > > > >> we > > > > >> > can do an index Time To Live. > > > > >> > > > > > >> > Regards, > > > > >> > Taher Koitawala > > > > >> > > > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar < > > [email protected]> > > > > >> wrote: > > > > >> > > > > > >> >> I still feel the key thing here is reimplementing > > HoodieBloomIndex > > > > >> without > > > > >> >> needing spark caching. > > > > >> >> > > > > >> >> > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global > > > > > > ) > > > > >> >> documents the spark DAG in detail. > > > > >> >> > > > > >> >> If everyone feels, it's best for me to scope the work out, then > > > happy > > > > >> to > > > > >> >> do > > > > >> >> it! > > > > >> >> > > > > >> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala < > > > [email protected] > > > > > > > > > >> >> wrote: > > > > >> >> > > > > >> >> > Guys I think we are slowing down on this again. We need to > > start > > > > >> >> planning > > > > >> >> > small small tasks towards this VC please can you help fast > > track > > > > >> this? > > > > >> >> > > > > > >> >> > Regards, > > > > >> >> > Taher Koitawala > > > > >> >> > > > > > >> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar < > > [email protected] > > > > > > > > >> >> wrote: > > > > >> >> > > > > > >> >> > > Look forward to the analysis. A key class to read would be > > > > >> >> > > HoodieBloomIndex, which uses a lot of spark caching and > > > shuffles. > > > > >> >> > > > > > > >> >> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang < > > > [email protected] > > > > > > > > > >> >> wrote: > > > > >> >> > > > > > > >> >> > > > >> Currently Spark Streaming micro batching fits well > with > > > > Hudi, > > > > >> >> since > > > > >> >> > it > > > > >> >> > > > amortizes the cost of indexing, workload profiling etc. 1 > > > spark > > > > >> >> micro > > > > >> >> > > batch > > > > >> >> > > > = 1 hudi commit > > > > >> >> > > > With the per-record model in Flink, I am not sure how > > useful > > > it > > > > >> >> will be > > > > >> >> > > to > > > > >> >> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi > > > commit, > > > > >> it > > > > >> >> will > > > > >> >> > > be > > > > >> >> > > > inefficient.. > > > > >> >> > > > > > > > >> >> > > > Yes, if 1 input record = 1 hudi commit, it would be > > > > inefficient. > > > > >> >> About > > > > >> >> > > > Flink streaming, we can also implement the "batch" and > > > > >> "micro-batch" > > > > >> >> > > model > > > > >> >> > > > when process data. For example: > > > > >> >> > > > > > > > >> >> > > > - aggregation: use flexibility window mechanism; > > > > >> >> > > > - non-aggregation: use Flink stateful state API cache a > > batch > > > > >> >> data > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > >> On first focussing on decoupling of Spark and Hudi > > alone, > > > > yes > > > > >> a > > > > >> >> full > > > > >> >> > > > summary of how Spark is being used in a wiki page is a > > good > > > > start > > > > >> >> IMO. > > > > >> >> > We > > > > >> >> > > > can then hash out what can be generalized and what cannot > > be > > > > and > > > > >> >> needs > > > > >> >> > to > > > > >> >> > > > be left in hudi-client-spark vs hudi-client-core > > > > >> >> > > > > > > > >> >> > > > agree > > > > >> >> > > > > > > > >> >> > > > Vinoth Chandar <[email protected]> 于2019年8月14日周三 > > 上午8:35写道: > > > > >> >> > > > > > > > >> >> > > > > >> We should only stick to Flink Streaming. Furthermore > > if > > > > >> there > > > > >> >> is a > > > > >> >> > > > > requirement for batch then users > > > > >> >> > > > > >> should use Spark or then we will anyway have a beam > > > > >> integration > > > > >> >> > > coming > > > > >> >> > > > > up. > > > > >> >> > > > > > > > > >> >> > > > > Currently Spark Streaming micro batching fits well with > > > Hudi, > > > > >> >> since > > > > >> >> > it > > > > >> >> > > > > amortizes the cost of indexing, workload profiling etc. > > 1 > > > > spark > > > > >> >> micro > > > > >> >> > > > batch > > > > >> >> > > > > = 1 hudi commit > > > > >> >> > > > > With the per-record model in Flink, I am not sure how > > > useful > > > > it > > > > >> >> will > > > > >> >> > be > > > > >> >> > > > to > > > > >> >> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi > > > > >> commit, it > > > > >> >> > will > > > > >> >> > > > be > > > > >> >> > > > > inefficient.. > > > > >> >> > > > > > > > > >> >> > > > > On first focussing on decoupling of Spark and Hudi > > alone, > > > > yes a > > > > >> >> full > > > > >> >> > > > > summary of how Spark is being used in a wiki page is a > > good > > > > >> start > > > > >> >> > IMO. > > > > >> >> > > We > > > > >> >> > > > > can then hash out what can be generalized and what > > cannot > > > be > > > > >> and > > > > >> >> > needs > > > > >> >> > > to > > > > >> >> > > > > be left in hudi-client-spark vs hudi-client-core > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang < > > > > >> [email protected]> > > > > >> >> > > wrote: > > > > >> >> > > > > > > > > >> >> > > > > > Hi Nick and Taher, > > > > >> >> > > > > > > > > > >> >> > > > > > I just want to answer Nishith's question. Reference > > his > > > old > > > > >> >> > > description > > > > >> >> > > > > > here: > > > > >> >> > > > > > > > > > >> >> > > > > > > You can do a parallel investigation while we are > > > deciding > > > > >> on > > > > >> >> the > > > > >> >> > > > module > > > > >> >> > > > > > structure. You could be looking at all the patterns > in > > > > >> Hudi's > > > > >> >> > Spark > > > > >> >> > > > APIs > > > > >> >> > > > > > usage (RDD/DataSource/SparkContext) and see if such > > > support > > > > >> can > > > > >> >> be > > > > >> >> > > > > achieved > > > > >> >> > > > > > in theory with Flink. If not, what is the workaround. > > > > >> >> Documenting > > > > >> >> > > such > > > > >> >> > > > > > patterns would be valuable when multiple engineers > are > > > > >> working > > > > >> >> on > > > > >> >> > it. > > > > >> >> > > > For > > > > >> >> > > > > > e:g, Hudi relies on (a) custom partitioning logic for > > > > >> >> upserts, > > > > >> >> > > > > (b) > > > > >> >> > > > > > caching RDDs to avoid reruns of costly stages (c) A > > Spark > > > > >> >> > upsert > > > > >> >> > > > task > > > > >> >> > > > > > knowing its spark partition/task/attempt ids > > > > >> >> > > > > > > > > > >> >> > > > > > And just like the title of this thread, we are going > > to > > > try > > > > >> to > > > > >> >> > > decouple > > > > >> >> > > > > > Hudi and Spark. That means we can run the whole Hudi > > > > without > > > > >> >> > > depending > > > > >> >> > > > > > Spark. So we need to analyze all the usage of Spark > in > > > > Hudi. > > > > >> >> > > > > > > > > > >> >> > > > > > Here we are not discussing the integration of Hudi > and > > > > Flink > > > > >> in > > > > >> >> the > > > > >> >> > > > > > application layer. Instead, I want Hudi to be > > decoupled > > > > from > > > > >> >> Spark > > > > >> >> > > and > > > > >> >> > > > > > allow other engines (such as Flink) to replace Spark. > > > > >> >> > > > > > > > > > >> >> > > > > > It can be divided into long-term goals and short-term > > > > goals. > > > > >> As > > > > >> >> > > Nishith > > > > >> >> > > > > > stated in a recent email. > > > > >> >> > > > > > > > > > >> >> > > > > > I mentioned the Flink Batch API here because Hudi can > > > > connect > > > > >> >> with > > > > >> >> > > many > > > > >> >> > > > > > different Source/Sinks. Some file-based reads are not > > > > >> >> appropriate > > > > >> >> > for > > > > >> >> > > > > Flink > > > > >> >> > > > > > Streaming. > > > > >> >> > > > > > > > > > >> >> > > > > > Therefore, this is a comprehensive survey of the use > > of > > > > >> Spark in > > > > >> >> > > Hudi. > > > > >> >> > > > > > > > > > >> >> > > > > > Best, > > > > >> >> > > > > > Vino > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > taher koitawala <[email protected]> 于2019年8月13日周二 > > > > 下午5:43写道: > > > > >> >> > > > > > > > > > >> >> > > > > > > Hi Vino, > > > > >> >> > > > > > > According to what I've seen Hudi has a lot of spark > > > > >> >> > component > > > > >> >> > > > > > flowing > > > > >> >> > > > > > > throwing it. Like Taskcontexts, JavaSparkContexts > > etc. > > > > The > > > > >> >> main > > > > >> >> > > > > classes I > > > > >> >> > > > > > > guess we should focus upon is HoodieTable and > Hoodie > > > > write > > > > >> >> > clients. > > > > >> >> > > > > > > > > > > >> >> > > > > > > Also Vino, I don't think we should be providing > > Flink > > > > >> dataset > > > > >> >> > > > > > > implementation. We should only stick to Flink > > > Streaming. > > > > >> >> > > > > > > Furthermore if there is a requirement for > > > > >> batch > > > > >> >> > then > > > > >> >> > > > > users > > > > >> >> > > > > > > should use Spark or then we will anyway have a beam > > > > >> >> integration > > > > >> >> > > > coming > > > > >> >> > > > > > up. > > > > >> >> > > > > > > > > > > >> >> > > > > > > As of cache, How about we write our stateful Flink > > > > function > > > > >> >> and > > > > >> >> > use > > > > >> >> > > > > > > RocksDbStateBackend with some state TTL. > > > > >> >> > > > > > > > > > > >> >> > > > > > > On Tue, Aug 13, 2019, 2:28 PM vino yang < > > > > >> >> [email protected]> > > > > >> >> > > > wrote: > > > > >> >> > > > > > > > > > > >> >> > > > > > > > Hi all, > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > After doing some research, let me share my > > > information: > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > - Limitation of computing engine capabilities: > > Hudi > > > > >> uses > > > > >> >> > > Spark's > > > > >> >> > > > > > > > RDD#persist, and Flink currently has no API to > > cache > > > > >> >> > datasets. > > > > >> >> > > > > Maybe > > > > >> >> > > > > > > we > > > > >> >> > > > > > > > can > > > > >> >> > > > > > > > only choose to use external storage or do not use > > > > >> cache? > > > > >> >> For > > > > >> >> > > the > > > > >> >> > > > > use > > > > >> >> > > > > > > of > > > > >> >> > > > > > > > other APIs, the two currently offer almost > > equivalent > > > > >> >> > > > > capabilities. > > > > >> >> > > > > > > > - The abstraction of the computing engine is > > > > >> different: > > > > >> >> > > > > Considering > > > > >> >> > > > > > > the > > > > >> >> > > > > > > > different usage scenarios of the computing engine > > in > > > > >> >> Hudi, > > > > >> >> > > Flink > > > > >> >> > > > > has > > > > >> >> > > > > > > not > > > > >> >> > > > > > > > yet implemented stream batch unification, so we > > may > > > > >> use > > > > >> >> both > > > > >> >> > > > > Flink's > > > > >> >> > > > > > > > DataSet API (batch processing) and DataStream API > > > > >> (stream > > > > >> >> > > > > > processing). > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > Best, > > > > >> >> > > > > > > > Vino > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > nishith agarwal <[email protected]> > > 于2019年8月8日周四 > > > > >> >> 上午12:57写道: > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > Nick, > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > You bring up a good point about the non-trivial > > > > >> >> programming > > > > >> >> > > model > > > > >> >> > > > > > > > > differences between these different > > technologies. > > > > From > > > > >> a > > > > >> >> > > > > theoretical > > > > >> >> > > > > > > > > perspective, I'd say considering a higher level > > > > >> >> abstraction > > > > >> >> > > makes > > > > >> >> > > > > > > sense. > > > > >> >> > > > > > > > I > > > > >> >> > > > > > > > > think we have to decouple some objectives and > > > > concerns > > > > >> >> here. > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > a) The immediate desire is to have Hudi be able > > to > > > > run > > > > >> on > > > > >> >> a > > > > >> >> > > Flink > > > > >> >> > > > > (or > > > > >> >> > > > > > > > > non-spark) engine. This naturally begs the > > question > > > > of > > > > >> >> > > decoupling > > > > >> >> > > > > > Hudi > > > > >> >> > > > > > > > > concepts from direct Spark dependencies. > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > b) If we do want to initiate the above effort, > > > would > > > > it > > > > >> >> make > > > > >> >> > > > sense > > > > >> >> > > > > to > > > > >> >> > > > > > > > just > > > > >> >> > > > > > > > > have a higher level abstraction, building on > > other > > > > >> >> > technologies > > > > >> >> > > > > like > > > > >> >> > > > > > > beam > > > > >> >> > > > > > > > > (euphoria etc) and provide single, clean API's > > that > > > > >> may be > > > > >> >> > more > > > > >> >> > > > > > > > > maintainable from a code perspective. But at > the > > > same > > > > >> time > > > > >> >> > this > > > > >> >> > > > > will > > > > >> >> > > > > > > > > introduce challenges on how to maintain > > efficiency > > > > and > > > > >> >> > > optimized > > > > >> >> > > > > > > runtime > > > > >> >> > > > > > > > > dags for Hudi (since the code would move away > > from > > > > >> point > > > > >> >> > > > > integrations > > > > >> >> > > > > > > and > > > > >> >> > > > > > > > > whenever this happens, tuning natively for > > specific > > > > >> >> engines > > > > >> >> > > > becomes > > > > >> >> > > > > > > more > > > > >> >> > > > > > > > > and more difficult). > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > My general opinion is that, as the community > > grows > > > > over > > > > >> >> time > > > > >> >> > > with > > > > >> >> > > > > > more > > > > >> >> > > > > > > > > folks having an in-depth understanding of Hudi, > > > going > > > > >> from > > > > >> >> > > > > > > current_state > > > > >> >> > > > > > > > -> > > > > >> >> > > > > > > > > (a) -> (b) might be the most reliable and > > adoptable > > > > >> path > > > > >> >> for > > > > >> >> > > this > > > > >> >> > > > > > > > project. > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > Thanks, > > > > >> >> > > > > > > > > Nishith > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng < > > > > >> >> > > > > > [email protected]> > > > > >> >> > > > > > > > > wrote: > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > There are some not trivial difference between > > > > >> >> programming > > > > >> >> > > model > > > > >> >> > > > > and > > > > >> >> > > > > > > > > > runtime semantics between Beam, Spark and > > Flink. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> > > > > > > > > > > https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Nitish, Vino - thoughts? > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Does it feel to consider a higher level > > > > abstraction / > > > > >> >> DSL > > > > >> >> > > > instead > > > > >> >> > > > > > of > > > > >> >> > > > > > > > > > maintaining different code with same > > > functionality > > > > >> but > > > > >> >> > > > different > > > > >> >> > > > > > > > > > programming models ? > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> https://beam.apache.org/documentation/sdks/java/euphoria/ > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Nick > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > On August 6, 2019 at 4:04 PM nishith agarwal > < > > > > >> >> > > > > [email protected]> > > > > >> >> > > > > > > > > wrote: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > +1 for Approach 1 Point integration with each > > > > >> framework. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Pros for point integration > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > - Hudi community is already familiar with > > spark > > > > >> and > > > > >> >> > spark > > > > >> >> > > > > based > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > actions/shuffles etc. Since both modules can > > be > > > > >> >> decoupled, > > > > >> >> > > this > > > > >> >> > > > > > > enables > > > > >> >> > > > > > > > > us > > > > >> >> > > > > > > > > > to have a steady release for Hudi for 1 > > execution > > > > >> engine > > > > >> >> > > > (spark) > > > > >> >> > > > > > > while > > > > >> >> > > > > > > > we > > > > >> >> > > > > > > > > > hone our skills and iterate on making flink > > dag > > > > >> >> optimized, > > > > >> >> > > > > > performant > > > > >> >> > > > > > > > > with > > > > >> >> > > > > > > > > > the right configuration. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > - This might be a stepping stone towards > > > rewriting > > > > >> >> the > > > > >> >> > > > entire > > > > >> >> > > > > > code > > > > >> >> > > > > > > > > base > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > being agnostic of spark/flink. This approach > > will > > > > >> help > > > > >> >> us > > > > >> >> > fix > > > > >> >> > > > > > tests, > > > > >> >> > > > > > > > > > intricacies and help make the code base ready > > > for a > > > > >> >> larger > > > > >> >> > > > > rework. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > - Seems like the easiest way to add flink > > support > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Cons > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > - More code paths to maintain and reason > since > > > the > > > > >> >> spark > > > > >> >> > > and > > > > >> >> > > > > > flink > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > integrations will naturally diverge over > time. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Theoretically, I do like the idea of being > > able > > > to > > > > >> run > > > > >> >> the > > > > >> >> > > hudi > > > > >> >> > > > > dag > > > > >> >> > > > > > > on > > > > >> >> > > > > > > > > beam > > > > >> >> > > > > > > > > > more than point integrations, where there is > > one > > > > >> >> API/logic > > > > >> >> > to > > > > >> >> > > > > > reason > > > > >> >> > > > > > > > > about. > > > > >> >> > > > > > > > > > But practically, that may not be the right > > > > direction. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Pros > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > - Lesser cognitive burden in maintaining, > > > evolving > > > > >> >> and > > > > >> >> > > > > releasing > > > > >> >> > > > > > > the > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > project with one API to reason with. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > - Theoretically, going forward assuming beam > > is > > > > >> >> adopted > > > > >> >> > > as a > > > > >> >> > > > > > > > standard > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > programming paradigm for stream/batch, this > > would > > > > >> enable > > > > >> >> > > > > consumers > > > > >> >> > > > > > > > > leverage > > > > >> >> > > > > > > > > > the power of hudi more easily. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Cons > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > - Massive rewrite of the code base. > > Additionally, > > > > >> >> since > > > > >> >> > we > > > > >> >> > > > > would > > > > >> >> > > > > > > > have > > > > >> >> > > > > > > > > > moved > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > away from directly using spark APIs, there is > > a > > > > >> bigger > > > > >> >> risk > > > > >> >> > > of > > > > >> >> > > > > > > > > regression. > > > > >> >> > > > > > > > > > We would have to be very thorough with all > the > > > > >> >> intricacies > > > > >> >> > > and > > > > >> >> > > > > > ensure > > > > >> >> > > > > > > > the > > > > >> >> > > > > > > > > > same stability of new releases. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > - Managing future features (which may be very > > > > >> spark > > > > >> >> > > driven) > > > > >> >> > > > > will > > > > >> >> > > > > > > > > either > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > clash or pause or will need to be reworked. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > - Tuning jobs for Spark/Flink type execution > > > > >> >> frameworks > > > > >> >> > > > > > > individually > > > > >> >> > > > > > > > > > might > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > be difficult and will get difficult over time > > as > > > > the > > > > >> >> > project > > > > >> >> > > > > > evolves, > > > > >> >> > > > > > > > > where > > > > >> >> > > > > > > > > > some beam integrations with spark/flink may > > not > > > > work > > > > >> as > > > > >> >> > > > expected. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > - Also, as pointed above, need to probably > > > support > > > > >> >> the > > > > >> >> > > > > > > hoodie-spark > > > > >> >> > > > > > > > > > module > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > as a first-class. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Thank, > > > > >> >> > > > > > > > > > Nishith > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher > koitawala > > < > > > > >> >> > > > > [email protected] > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > wrote: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Hi Vinoth, > > > > >> >> > > > > > > > > > Are there some tasks I can take up to ramp up > > the > > > > >> code? > > > > >> >> > Want > > > > >> >> > > to > > > > >> >> > > > > get > > > > >> >> > > > > > > > > > more used to the code and understand the > > existing > > > > >> >> > > > implementation > > > > >> >> > > > > > > > better. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Thanks, > > > > >> >> > > > > > > > > > Taher Koitawala > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar > < > > > > >> >> > > > [email protected]> > > > > >> >> > > > > > > > wrote: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Let's see if others have any thoughts as > well. > > We > > > > can > > > > >> >> plan > > > > >> >> > to > > > > >> >> > > > fix > > > > >> >> > > > > > the > > > > >> >> > > > > > > > > > approach by EOW. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang < > > > > >> >> > > > [email protected]> > > > > >> >> > > > > > > > wrote: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Hi guys, > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Also, +1 for Approach 1 like Taher. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > If we can do a comprehensive analysis of this > > > model > > > > >> and > > > > >> >> > come > > > > >> >> > > up > > > > >> >> > > > > > with. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > means > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > to refactor this cleanly, this would be > > > promising. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Yes, when we get the conclusion, we could > > start > > > > this > > > > >> >> work. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Best, > > > > >> >> > > > > > > > > > Vino > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > taher koitawala <[email protected]> > > > 于2019年8月6日周二 > > > > >> >> > 上午12:28写道: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > +1 for Approch 1 Point integration with each > > > > >> framework > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Approach 2 has a problem as you said > > "Developers > > > > >> need to > > > > >> >> > > think > > > > >> >> > > > > > about > > > > >> >> > > > > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. > > > > >> So in > > > > >> >> > the > > > > >> >> > > > end, > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > this > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > may > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > not be the panacea that it seems to be" > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > We have seen various pipelines in the beam > dag > > > > being > > > > >> >> > > expressed > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > differently > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > then we had them in our original usecase. And > > > also > > > > >> >> > switching > > > > >> >> > > > > > between > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > spark > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > and Flink runners in beam have various impact > > on > > > > the > > > > >> >> > > pipelines > > > > >> >> > > > > like > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > some > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > features available in Flink are not available > > on > > > > the > > > > >> >> spark > > > > >> >> > > > runner > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > etc. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Refer to this compatible matrix -> > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > https://beam.apache.org/documentation/runners/capability-matrix/ > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Hence my vote on Approch 1 let's decouple and > > > build > > > > >> the > > > > >> >> > > > abstract > > > > >> >> > > > > > for > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > each > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > framework. That is a much better option. We > > will > > > > also > > > > >> >> have > > > > >> >> > > more > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > control > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > over each framework's implement. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar < > > > > >> >> > > [email protected] > > > > >> >> > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > wrote: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Would like to highlight that there are two > > > distinct > > > > >> >> > > approaches > > > > >> >> > > > > here > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > with > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > different tradeoffs. Think of this as my > > > braindump, > > > > >> as I > > > > >> >> > have > > > > >> >> > > > > been > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > thinking > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > about this quite a bit in the past. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > *Approach 1 : Point integration with each > > > > framework * > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > We may need a pure client module named for > > > example > > > > >> >> > > > > > > > > > hoodie-client-core(common) > > > > >> >> > > > > > > > > > >> Then we could have: hoodie-client-spark, > > > > >> >> > > hoodie-client-flink > > > > >> >> > > > > > > > > > and hoodie-client-beam > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > (+) This is the safest to do IMO, since we > can > > > > >> isolate > > > > >> >> the > > > > >> >> > > > > current > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Spark > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > execution (hoodie-spark, hoodie-client-spark) > > > from > > > > >> the > > > > >> >> > > changes > > > > >> >> > > > > for > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > flink, > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > while it stabilizes over few releases. > > > > >> >> > > > > > > > > > (-) Downside is that the utilities needs to > be > > > > >> redone : > > > > >> >> > > > > > > > > > hoodie-utilities-spark and > > hoodie-utilities-flink > > > > and > > > > >> >> > > > > > > > > > hoodie-utilities-core ? hoodie-cli? > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > If we can do a comprehensive analysis of this > > > model > > > > >> and > > > > >> >> > come > > > > >> >> > > up > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > with. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > means > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > to refactor this cleanly, this would be > > > promising. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > *Approach 2: Beam as the compute abstraction* > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Another more drastic approach is to remove > > Spark > > > as > > > > >> the > > > > >> >> > > compute > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > abstraction > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > for writing data and replace it with Beam. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > (+) All of the code remains more or less > > similar > > > > and > > > > >> >> there > > > > >> >> > is > > > > >> >> > > > one > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > compute > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > API to reason about. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > (-) The (very big) assumption here is that we > > are > > > > >> able > > > > >> >> to > > > > >> >> > > tune > > > > >> >> > > > > the > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > spark > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > runtime the same way using Beam : custom > > > > >> partitioners, > > > > >> >> > > support > > > > >> >> > > > > for > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > all > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > RDD > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > operations we invoke, caching etc etc. > > > > >> >> > > > > > > > > > (-) It will be a massive rewrite and testing > > of > > > > such > > > > >> a > > > > >> >> > large > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > rewrite > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > would > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > also be really challenging, since we need to > > pay > > > > >> >> attention > > > > >> >> > to > > > > >> >> > > > all > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > intricate > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > details to ensure the spark users today > > > experience > > > > no > > > > >> >> > > > > > > > > > regressions/side-effects > > > > >> >> > > > > > > > > > (-) Note that we still need to probably > > support > > > the > > > > >> >> > > > hoodie-spark > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > module > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > and > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > may be a first-class such integration with > > flink, > > > > for > > > > >> >> > native > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > flink/spark > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > pipeline authoring. Users of say > DeltaStreamer > > > need > > > > >> to > > > > >> >> pass > > > > >> >> > > in > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Spark > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > or > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Flink configs anyway.. Developers need to > > think > > > > about > > > > >> >> > > > > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. > > > > >> So in > > > > >> >> > the > > > > >> >> > > > end, > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > this > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > may > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > not be the panacea that it seems to be. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > One goal for the HIP is to get us all to > agree > > > as a > > > > >> >> > community > > > > >> >> > > > > which > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > one > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > to > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > pick, with sufficient investigation, testing, > > > > >> >> > benchmarking.. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang < > > > > >> >> > > > [email protected]> > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > wrote: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > +1 for both Beam and Flink > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > First step here is to probably draw out > > current > > > > >> >> hierrarchy > > > > >> >> > > and > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > figure > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > out > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > what the abstraction points are.. > > > > >> >> > > > > > > > > > In my opinion, the runtime (spark, flink) > > should > > > be > > > > >> >> done at > > > > >> >> > > the > > > > >> >> > > > > > > > > > hoodie-client level and just used by > > > > hoodie-utilties > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > seamlessly.. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > +1 for Vinoth's opinion, it should be the > > first > > > > step. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > No matter we hope Hudi to integrate with > which > > > > >> computing > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > framework. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > We need to decouple Hudi client and Spark. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > We may need a pure client module named for > > > example > > > > >> >> > > > > > > > > > hoodie-client-core(common) > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Then we could have: hoodie-client-spark, > > > > >> >> > hoodie-client-flink > > > > >> >> > > > and > > > > >> >> > > > > > > > > > hoodie-client-beam > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Suneel Marthi <[email protected]> > > 于2019年8月4日周日 > > > > >> >> 上午10:45写道: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > +1 for Beam -- agree with Semantic Beeng's > > > > analysis. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher > > koitawala < > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > [email protected]> > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > wrote: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > So the way to go around this is that file a > > hip. > > > > >> Chalk > > > > >> >> all > > > > >> >> > th > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > classes > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > our > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > and start moving towards Pure client. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Secondly should we want to try beam? > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > I think there is to much going on here and > I'm > > > not > > > > >> able > > > > >> >> to > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > follow. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > If > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > we > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > want to try out beam all along I don't think > > it > > > > makes > > > > >> >> sense > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > to > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > do > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > anything > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > on Flink then. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng < > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > [email protected]> > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > wrote: > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > >> +1 My money is on this approach. > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> The existing abstractions from Beam seem > > > enough > > > > >> for > > > > >> >> the > > > > >> >> > > use > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > cases > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > as I > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > imagine them. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > >> Flink also has "dynamic table", "table > > source" > > > > and > > > > >> >> > "table > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > sink" > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > which > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > seem very useful abstractions where Hudi > might > > > fit > > > > >> >> nicely. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> > > > > > > > > > > https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> Attached a screen shot. > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> This seems to fit with the original > premise > > of > > > > >> Hudi > > > > >> >> as > > > > >> >> > > well. > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> Am exploring this venue with a use case > > that > > > > >> involves > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > "temporal > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > joins > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > on > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > streams" which I need for feature extraction. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > >> Anyone is interested in this or has > > concrete > > > > >> enough > > > > >> >> > needs > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > and > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > use > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > cases > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > please let me know. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > >> Best to go from an agreed upon set of 2-3 > > use > > > > >> cases. > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> Cheers > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> Nick > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > >> > Also, we do have some Beam experts on > the > > > > >> mailing > > > > >> >> > list.. > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > Can > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > you > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > please > > > > >> >> > > > > > > > > > >> weigh on viability of using Beam as the > > > > >> intermediate > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > abstraction > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > here > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > between Spark/Flink? > > > > >> >> > > > > > > > > > Hudi uses RDD apis like groupBy, mapToPair, > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > sortAndRepartition, > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > reduceByKey, countByKey and also does custom > > > > >> >> partitioning a > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > lot.> > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > >> > > > > > >> >> > > > > > > > > > >> > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >
