Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex with HeapState, RocksDBState and FsState but on Spark.
On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala <taher...@gmail.com> wrote: > Hi Vinoth, > Having seen the doc and code. I understand the > HoodieBloomIndex mainly caches key and partition path. Can we address how > Flink does it? Like, have HeapState where the user chooses to cache the > Index on heap, RockDBState where indexes are written to RocksDB and finally > FsState where indexes can be written to HDFS, S3, Azure Fs. And on top, we > can do an index Time To Live. > > Regards, > Taher Koitawala > > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar <vin...@apache.org> wrote: > >> I still feel the key thing here is reimplementing HoodieBloomIndex without >> needing spark caching. >> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global) >> documents the spark DAG in detail. >> >> If everyone feels, it's best for me to scope the work out, then happy to >> do >> it! >> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala <taher...@gmail.com> >> wrote: >> >> > Guys I think we are slowing down on this again. We need to start >> planning >> > small small tasks towards this VC please can you help fast track this? >> > >> > Regards, >> > Taher Koitawala >> > >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar <vin...@apache.org> >> wrote: >> > >> > > Look forward to the analysis. A key class to read would be >> > > HoodieBloomIndex, which uses a lot of spark caching and shuffles. >> > > >> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang <yanghua1...@gmail.com> >> wrote: >> > > >> > > > >> Currently Spark Streaming micro batching fits well with Hudi, >> since >> > it >> > > > amortizes the cost of indexing, workload profiling etc. 1 spark >> micro >> > > batch >> > > > = 1 hudi commit >> > > > With the per-record model in Flink, I am not sure how useful it >> will be >> > > to >> > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it >> will >> > > be >> > > > inefficient.. >> > > > >> > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient. >> About >> > > > Flink streaming, we can also implement the "batch" and "micro-batch" >> > > model >> > > > when process data. For example: >> > > > >> > > > - aggregation: use flexibility window mechanism; >> > > > - non-aggregation: use Flink stateful state API cache a batch >> data >> > > > >> > > > >> > > > >> On first focussing on decoupling of Spark and Hudi alone, yes a >> full >> > > > summary of how Spark is being used in a wiki page is a good start >> IMO. >> > We >> > > > can then hash out what can be generalized and what cannot be and >> needs >> > to >> > > > be left in hudi-client-spark vs hudi-client-core >> > > > >> > > > agree >> > > > >> > > > Vinoth Chandar <vin...@apache.org> 于2019年8月14日周三 上午8:35写道: >> > > > >> > > > > >> We should only stick to Flink Streaming. Furthermore if there >> is a >> > > > > requirement for batch then users >> > > > > >> should use Spark or then we will anyway have a beam integration >> > > coming >> > > > > up. >> > > > > >> > > > > Currently Spark Streaming micro batching fits well with Hudi, >> since >> > it >> > > > > amortizes the cost of indexing, workload profiling etc. 1 spark >> micro >> > > > batch >> > > > > = 1 hudi commit >> > > > > With the per-record model in Flink, I am not sure how useful it >> will >> > be >> > > > to >> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it >> > will >> > > > be >> > > > > inefficient.. >> > > > > >> > > > > On first focussing on decoupling of Spark and Hudi alone, yes a >> full >> > > > > summary of how Spark is being used in a wiki page is a good start >> > IMO. >> > > We >> > > > > can then hash out what can be generalized and what cannot be and >> > needs >> > > to >> > > > > be left in hudi-client-spark vs hudi-client-core >> > > > > >> > > > > >> > > > > >> > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang <yanghua1...@gmail.com> >> > > wrote: >> > > > > >> > > > > > Hi Nick and Taher, >> > > > > > >> > > > > > I just want to answer Nishith's question. Reference his old >> > > description >> > > > > > here: >> > > > > > >> > > > > > > You can do a parallel investigation while we are deciding on >> the >> > > > module >> > > > > > structure. You could be looking at all the patterns in Hudi's >> > Spark >> > > > APIs >> > > > > > usage (RDD/DataSource/SparkContext) and see if such support can >> be >> > > > > achieved >> > > > > > in theory with Flink. If not, what is the workaround. >> Documenting >> > > such >> > > > > > patterns would be valuable when multiple engineers are working >> on >> > it. >> > > > For >> > > > > > e:g, Hudi relies on (a) custom partitioning logic for >> upserts, >> > > > > (b) >> > > > > > caching RDDs to avoid reruns of costly stages (c) A Spark >> > upsert >> > > > task >> > > > > > knowing its spark partition/task/attempt ids >> > > > > > >> > > > > > And just like the title of this thread, we are going to try to >> > > decouple >> > > > > > Hudi and Spark. That means we can run the whole Hudi without >> > > depending >> > > > > > Spark. So we need to analyze all the usage of Spark in Hudi. >> > > > > > >> > > > > > Here we are not discussing the integration of Hudi and Flink in >> the >> > > > > > application layer. Instead, I want Hudi to be decoupled from >> Spark >> > > and >> > > > > > allow other engines (such as Flink) to replace Spark. >> > > > > > >> > > > > > It can be divided into long-term goals and short-term goals. As >> > > Nishith >> > > > > > stated in a recent email. >> > > > > > >> > > > > > I mentioned the Flink Batch API here because Hudi can connect >> with >> > > many >> > > > > > different Source/Sinks. Some file-based reads are not >> appropriate >> > for >> > > > > Flink >> > > > > > Streaming. >> > > > > > >> > > > > > Therefore, this is a comprehensive survey of the use of Spark in >> > > Hudi. >> > > > > > >> > > > > > Best, >> > > > > > Vino >> > > > > > >> > > > > > >> > > > > > taher koitawala <taher...@gmail.com> 于2019年8月13日周二 下午5:43写道: >> > > > > > >> > > > > > > Hi Vino, >> > > > > > > According to what I've seen Hudi has a lot of spark >> > component >> > > > > > flowing >> > > > > > > throwing it. Like Taskcontexts, JavaSparkContexts etc. The >> main >> > > > > classes I >> > > > > > > guess we should focus upon is HoodieTable and Hoodie write >> > clients. >> > > > > > > >> > > > > > > Also Vino, I don't think we should be providing Flink dataset >> > > > > > > implementation. We should only stick to Flink Streaming. >> > > > > > > Furthermore if there is a requirement for batch >> > then >> > > > > users >> > > > > > > should use Spark or then we will anyway have a beam >> integration >> > > > coming >> > > > > > up. >> > > > > > > >> > > > > > > As of cache, How about we write our stateful Flink function >> and >> > use >> > > > > > > RocksDbStateBackend with some state TTL. >> > > > > > > >> > > > > > > On Tue, Aug 13, 2019, 2:28 PM vino yang < >> yanghua1...@gmail.com> >> > > > wrote: >> > > > > > > >> > > > > > > > Hi all, >> > > > > > > > >> > > > > > > > After doing some research, let me share my information: >> > > > > > > > >> > > > > > > > >> > > > > > > > - Limitation of computing engine capabilities: Hudi uses >> > > Spark's >> > > > > > > > RDD#persist, and Flink currently has no API to cache >> > datasets. >> > > > > Maybe >> > > > > > > we >> > > > > > > > can >> > > > > > > > only choose to use external storage or do not use cache? >> For >> > > the >> > > > > use >> > > > > > > of >> > > > > > > > other APIs, the two currently offer almost equivalent >> > > > > capabilities. >> > > > > > > > - The abstraction of the computing engine is different: >> > > > > Considering >> > > > > > > the >> > > > > > > > different usage scenarios of the computing engine in >> Hudi, >> > > Flink >> > > > > has >> > > > > > > not >> > > > > > > > yet implemented stream batch unification, so we may use >> both >> > > > > Flink's >> > > > > > > > DataSet API (batch processing) and DataStream API (stream >> > > > > > processing). >> > > > > > > > >> > > > > > > > Best, >> > > > > > > > Vino >> > > > > > > > >> > > > > > > > nishith agarwal <n3.nas...@gmail.com> 于2019年8月8日周四 >> 上午12:57写道: >> > > > > > > > >> > > > > > > > > Nick, >> > > > > > > > > >> > > > > > > > > You bring up a good point about the non-trivial >> programming >> > > model >> > > > > > > > > differences between these different technologies. From a >> > > > > theoretical >> > > > > > > > > perspective, I'd say considering a higher level >> abstraction >> > > makes >> > > > > > > sense. >> > > > > > > > I >> > > > > > > > > think we have to decouple some objectives and concerns >> here. >> > > > > > > > > >> > > > > > > > > a) The immediate desire is to have Hudi be able to run on >> a >> > > Flink >> > > > > (or >> > > > > > > > > non-spark) engine. This naturally begs the question of >> > > decoupling >> > > > > > Hudi >> > > > > > > > > concepts from direct Spark dependencies. >> > > > > > > > > >> > > > > > > > > b) If we do want to initiate the above effort, would it >> make >> > > > sense >> > > > > to >> > > > > > > > just >> > > > > > > > > have a higher level abstraction, building on other >> > technologies >> > > > > like >> > > > > > > beam >> > > > > > > > > (euphoria etc) and provide single, clean API's that may be >> > more >> > > > > > > > > maintainable from a code perspective. But at the same time >> > this >> > > > > will >> > > > > > > > > introduce challenges on how to maintain efficiency and >> > > optimized >> > > > > > > runtime >> > > > > > > > > dags for Hudi (since the code would move away from point >> > > > > integrations >> > > > > > > and >> > > > > > > > > whenever this happens, tuning natively for specific >> engines >> > > > becomes >> > > > > > > more >> > > > > > > > > and more difficult). >> > > > > > > > > >> > > > > > > > > My general opinion is that, as the community grows over >> time >> > > with >> > > > > > more >> > > > > > > > > folks having an in-depth understanding of Hudi, going from >> > > > > > > current_state >> > > > > > > > -> >> > > > > > > > > (a) -> (b) might be the most reliable and adoptable path >> for >> > > this >> > > > > > > > project. >> > > > > > > > > >> > > > > > > > > Thanks, >> > > > > > > > > Nishith >> > > > > > > > > >> > > > > > > > > On Tue, Aug 6, 2019 at 1:30 PM Semantic Beeng < >> > > > > > n...@semanticbeeng.com> >> > > > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > There are some not trivial difference between >> programming >> > > model >> > > > > and >> > > > > > > > > > runtime semantics between Beam, Spark and Flink. >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-how >> > > > > > > > > > >> > > > > > > > > > Nitish, Vino - thoughts? >> > > > > > > > > > >> > > > > > > > > > Does it feel to consider a higher level abstraction / >> DSL >> > > > instead >> > > > > > of >> > > > > > > > > > maintaining different code with same functionality but >> > > > different >> > > > > > > > > > programming models ? >> > > > > > > > > > >> > > > > > > > > > >> https://beam.apache.org/documentation/sdks/java/euphoria/ >> > > > > > > > > > >> > > > > > > > > > Nick >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > On August 6, 2019 at 4:04 PM nishith agarwal < >> > > > > n3.nas...@gmail.com> >> > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > +1 for Approach 1 Point integration with each framework. >> > > > > > > > > > >> > > > > > > > > > Pros for point integration >> > > > > > > > > > >> > > > > > > > > > - Hudi community is already familiar with spark and >> > spark >> > > > > based >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > actions/shuffles etc. Since both modules can be >> decoupled, >> > > this >> > > > > > > enables >> > > > > > > > > us >> > > > > > > > > > to have a steady release for Hudi for 1 execution engine >> > > > (spark) >> > > > > > > while >> > > > > > > > we >> > > > > > > > > > hone our skills and iterate on making flink dag >> optimized, >> > > > > > performant >> > > > > > > > > with >> > > > > > > > > > the right configuration. >> > > > > > > > > > >> > > > > > > > > > - This might be a stepping stone towards rewriting >> the >> > > > entire >> > > > > > code >> > > > > > > > > base >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > being agnostic of spark/flink. This approach will help >> us >> > fix >> > > > > > tests, >> > > > > > > > > > intricacies and help make the code base ready for a >> larger >> > > > > rework. >> > > > > > > > > > >> > > > > > > > > > - Seems like the easiest way to add flink support >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > Cons >> > > > > > > > > > >> > > > > > > > > > - More code paths to maintain and reason since the >> spark >> > > and >> > > > > > flink >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > integrations will naturally diverge over time. >> > > > > > > > > > >> > > > > > > > > > Theoretically, I do like the idea of being able to run >> the >> > > hudi >> > > > > dag >> > > > > > > on >> > > > > > > > > beam >> > > > > > > > > > more than point integrations, where there is one >> API/logic >> > to >> > > > > > reason >> > > > > > > > > about. >> > > > > > > > > > But practically, that may not be the right direction. >> > > > > > > > > > >> > > > > > > > > > Pros >> > > > > > > > > > >> > > > > > > > > > - Lesser cognitive burden in maintaining, evolving >> and >> > > > > releasing >> > > > > > > the >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > project with one API to reason with. >> > > > > > > > > > >> > > > > > > > > > - Theoretically, going forward assuming beam is >> adopted >> > > as a >> > > > > > > > standard >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > programming paradigm for stream/batch, this would enable >> > > > > consumers >> > > > > > > > > leverage >> > > > > > > > > > the power of hudi more easily. >> > > > > > > > > > >> > > > > > > > > > Cons >> > > > > > > > > > >> > > > > > > > > > - Massive rewrite of the code base. Additionally, >> since >> > we >> > > > > would >> > > > > > > > have >> > > > > > > > > > moved >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > away from directly using spark APIs, there is a bigger >> risk >> > > of >> > > > > > > > > regression. >> > > > > > > > > > We would have to be very thorough with all the >> intricacies >> > > and >> > > > > > ensure >> > > > > > > > the >> > > > > > > > > > same stability of new releases. >> > > > > > > > > > >> > > > > > > > > > - Managing future features (which may be very spark >> > > driven) >> > > > > will >> > > > > > > > > either >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > clash or pause or will need to be reworked. >> > > > > > > > > > >> > > > > > > > > > - Tuning jobs for Spark/Flink type execution >> frameworks >> > > > > > > individually >> > > > > > > > > > might >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > be difficult and will get difficult over time as the >> > project >> > > > > > evolves, >> > > > > > > > > where >> > > > > > > > > > some beam integrations with spark/flink may not work as >> > > > expected. >> > > > > > > > > > >> > > > > > > > > > - Also, as pointed above, need to probably support >> the >> > > > > > > hoodie-spark >> > > > > > > > > > module >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > as a first-class. >> > > > > > > > > > >> > > > > > > > > > Thank, >> > > > > > > > > > Nishith >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > On Tue, Aug 6, 2019 at 9:48 AM taher koitawala < >> > > > > taher...@gmail.com >> > > > > > > >> > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > Hi Vinoth, >> > > > > > > > > > Are there some tasks I can take up to ramp up the code? >> > Want >> > > to >> > > > > get >> > > > > > > > > > more used to the code and understand the existing >> > > > implementation >> > > > > > > > better. >> > > > > > > > > > >> > > > > > > > > > Thanks, >> > > > > > > > > > Taher Koitawala >> > > > > > > > > > >> > > > > > > > > > On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar < >> > > > vin...@apache.org> >> > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > Let's see if others have any thoughts as well. We can >> plan >> > to >> > > > fix >> > > > > > the >> > > > > > > > > > approach by EOW. >> > > > > > > > > > >> > > > > > > > > > On Mon, Aug 5, 2019 at 7:06 PM vino yang < >> > > > yanghua1...@gmail.com> >> > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > Hi guys, >> > > > > > > > > > >> > > > > > > > > > Also, +1 for Approach 1 like Taher. >> > > > > > > > > > >> > > > > > > > > > If we can do a comprehensive analysis of this model and >> > come >> > > up >> > > > > > with. >> > > > > > > > > > >> > > > > > > > > > means >> > > > > > > > > > >> > > > > > > > > > to refactor this cleanly, this would be promising. >> > > > > > > > > > >> > > > > > > > > > Yes, when we get the conclusion, we could start this >> work. >> > > > > > > > > > >> > > > > > > > > > Best, >> > > > > > > > > > Vino >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > taher koitawala <taher...@gmail.com> 于2019年8月6日周二 >> > 上午12:28写道: >> > > > > > > > > > >> > > > > > > > > > +1 for Approch 1 Point integration with each framework >> > > > > > > > > > >> > > > > > > > > > Approach 2 has a problem as you said "Developers need to >> > > think >> > > > > > about >> > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in >> > the >> > > > end, >> > > > > > > > > > >> > > > > > > > > > this >> > > > > > > > > > >> > > > > > > > > > may >> > > > > > > > > > >> > > > > > > > > > not be the panacea that it seems to be" >> > > > > > > > > > >> > > > > > > > > > We have seen various pipelines in the beam dag being >> > > expressed >> > > > > > > > > > >> > > > > > > > > > differently >> > > > > > > > > > >> > > > > > > > > > then we had them in our original usecase. And also >> > switching >> > > > > > between >> > > > > > > > > > >> > > > > > > > > > spark >> > > > > > > > > > >> > > > > > > > > > and Flink runners in beam have various impact on the >> > > pipelines >> > > > > like >> > > > > > > > > > >> > > > > > > > > > some >> > > > > > > > > > >> > > > > > > > > > features available in Flink are not available on the >> spark >> > > > runner >> > > > > > > > > > >> > > > > > > > > > etc. >> > > > > > > > > > >> > > > > > > > > > Refer to this compatible matrix -> >> > > > > > > > > > >> > > > https://beam.apache.org/documentation/runners/capability-matrix/ >> > > > > > > > > > >> > > > > > > > > > Hence my vote on Approch 1 let's decouple and build the >> > > > abstract >> > > > > > for >> > > > > > > > > > >> > > > > > > > > > each >> > > > > > > > > > >> > > > > > > > > > framework. That is a much better option. We will also >> have >> > > more >> > > > > > > > > > >> > > > > > > > > > control >> > > > > > > > > > >> > > > > > > > > > over each framework's implement. >> > > > > > > > > > >> > > > > > > > > > On Mon, Aug 5, 2019, 9:28 PM Vinoth Chandar < >> > > vin...@apache.org >> > > > > >> > > > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > Would like to highlight that there are two distinct >> > > approaches >> > > > > here >> > > > > > > > > > >> > > > > > > > > > with >> > > > > > > > > > >> > > > > > > > > > different tradeoffs. Think of this as my braindump, as I >> > have >> > > > > been >> > > > > > > > > > >> > > > > > > > > > thinking >> > > > > > > > > > >> > > > > > > > > > about this quite a bit in the past. >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > *Approach 1 : Point integration with each framework * >> > > > > > > > > > >> > > > > > > > > > We may need a pure client module named for example >> > > > > > > > > > hoodie-client-core(common) >> > > > > > > > > > >> Then we could have: hoodie-client-spark, >> > > hoodie-client-flink >> > > > > > > > > > and hoodie-client-beam >> > > > > > > > > > >> > > > > > > > > > (+) This is the safest to do IMO, since we can isolate >> the >> > > > > current >> > > > > > > > > > >> > > > > > > > > > Spark >> > > > > > > > > > >> > > > > > > > > > execution (hoodie-spark, hoodie-client-spark) from the >> > > changes >> > > > > for >> > > > > > > > > > >> > > > > > > > > > flink, >> > > > > > > > > > >> > > > > > > > > > while it stabilizes over few releases. >> > > > > > > > > > (-) Downside is that the utilities needs to be redone : >> > > > > > > > > > hoodie-utilities-spark and hoodie-utilities-flink and >> > > > > > > > > > hoodie-utilities-core ? hoodie-cli? >> > > > > > > > > > >> > > > > > > > > > If we can do a comprehensive analysis of this model and >> > come >> > > up >> > > > > > > > > > >> > > > > > > > > > with. >> > > > > > > > > > >> > > > > > > > > > means >> > > > > > > > > > >> > > > > > > > > > to refactor this cleanly, this would be promising. >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > *Approach 2: Beam as the compute abstraction* >> > > > > > > > > > >> > > > > > > > > > Another more drastic approach is to remove Spark as the >> > > compute >> > > > > > > > > > >> > > > > > > > > > abstraction >> > > > > > > > > > >> > > > > > > > > > for writing data and replace it with Beam. >> > > > > > > > > > >> > > > > > > > > > (+) All of the code remains more or less similar and >> there >> > is >> > > > one >> > > > > > > > > > >> > > > > > > > > > compute >> > > > > > > > > > >> > > > > > > > > > API to reason about. >> > > > > > > > > > >> > > > > > > > > > (-) The (very big) assumption here is that we are able >> to >> > > tune >> > > > > the >> > > > > > > > > > >> > > > > > > > > > spark >> > > > > > > > > > >> > > > > > > > > > runtime the same way using Beam : custom partitioners, >> > > support >> > > > > for >> > > > > > > > > > >> > > > > > > > > > all >> > > > > > > > > > >> > > > > > > > > > RDD >> > > > > > > > > > >> > > > > > > > > > operations we invoke, caching etc etc. >> > > > > > > > > > (-) It will be a massive rewrite and testing of such a >> > large >> > > > > > > > > > >> > > > > > > > > > rewrite >> > > > > > > > > > >> > > > > > > > > > would >> > > > > > > > > > >> > > > > > > > > > also be really challenging, since we need to pay >> attention >> > to >> > > > all >> > > > > > > > > > >> > > > > > > > > > intricate >> > > > > > > > > > >> > > > > > > > > > details to ensure the spark users today experience no >> > > > > > > > > > regressions/side-effects >> > > > > > > > > > (-) Note that we still need to probably support the >> > > > hoodie-spark >> > > > > > > > > > >> > > > > > > > > > module >> > > > > > > > > > >> > > > > > > > > > and >> > > > > > > > > > >> > > > > > > > > > may be a first-class such integration with flink, for >> > native >> > > > > > > > > > >> > > > > > > > > > flink/spark >> > > > > > > > > > >> > > > > > > > > > pipeline authoring. Users of say DeltaStreamer need to >> pass >> > > in >> > > > > > > > > > >> > > > > > > > > > Spark >> > > > > > > > > > >> > > > > > > > > > or >> > > > > > > > > > >> > > > > > > > > > Flink configs anyway.. Developers need to think about >> > > > > > > > > > what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in >> > the >> > > > end, >> > > > > > > > > > >> > > > > > > > > > this >> > > > > > > > > > >> > > > > > > > > > may >> > > > > > > > > > >> > > > > > > > > > not be the panacea that it seems to be. >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > One goal for the HIP is to get us all to agree as a >> > community >> > > > > which >> > > > > > > > > > >> > > > > > > > > > one >> > > > > > > > > > >> > > > > > > > > > to >> > > > > > > > > > >> > > > > > > > > > pick, with sufficient investigation, testing, >> > benchmarking.. >> > > > > > > > > > >> > > > > > > > > > On Sat, Aug 3, 2019 at 7:56 PM vino yang < >> > > > yanghua1...@gmail.com> >> > > > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > +1 for both Beam and Flink >> > > > > > > > > > >> > > > > > > > > > First step here is to probably draw out current >> hierrarchy >> > > and >> > > > > > > > > > >> > > > > > > > > > figure >> > > > > > > > > > >> > > > > > > > > > out >> > > > > > > > > > >> > > > > > > > > > what the abstraction points are.. >> > > > > > > > > > In my opinion, the runtime (spark, flink) should be >> done at >> > > the >> > > > > > > > > > hoodie-client level and just used by hoodie-utilties >> > > > > > > > > > >> > > > > > > > > > seamlessly.. >> > > > > > > > > > >> > > > > > > > > > +1 for Vinoth's opinion, it should be the first step. >> > > > > > > > > > >> > > > > > > > > > No matter we hope Hudi to integrate with which computing >> > > > > > > > > > >> > > > > > > > > > framework. >> > > > > > > > > > >> > > > > > > > > > We need to decouple Hudi client and Spark. >> > > > > > > > > > >> > > > > > > > > > We may need a pure client module named for example >> > > > > > > > > > hoodie-client-core(common) >> > > > > > > > > > >> > > > > > > > > > Then we could have: hoodie-client-spark, >> > hoodie-client-flink >> > > > and >> > > > > > > > > > hoodie-client-beam >> > > > > > > > > > >> > > > > > > > > > Suneel Marthi <smar...@apache.org> 于2019年8月4日周日 >> 上午10:45写道: >> > > > > > > > > > >> > > > > > > > > > +1 for Beam -- agree with Semantic Beeng's analysis. >> > > > > > > > > > >> > > > > > > > > > On Sat, Aug 3, 2019 at 10:30 PM taher koitawala < >> > > > > > > > > > >> > > > > > > > > > taher...@gmail.com> >> > > > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > So the way to go around this is that file a hip. Chalk >> all >> > th >> > > > > > > > > > >> > > > > > > > > > classes >> > > > > > > > > > >> > > > > > > > > > our >> > > > > > > > > > >> > > > > > > > > > and start moving towards Pure client. >> > > > > > > > > > >> > > > > > > > > > Secondly should we want to try beam? >> > > > > > > > > > >> > > > > > > > > > I think there is to much going on here and I'm not able >> to >> > > > > > > > > > >> > > > > > > > > > follow. >> > > > > > > > > > >> > > > > > > > > > If >> > > > > > > > > > >> > > > > > > > > > we >> > > > > > > > > > >> > > > > > > > > > want to try out beam all along I don't think it makes >> sense >> > > > > > > > > > >> > > > > > > > > > to >> > > > > > > > > > >> > > > > > > > > > do >> > > > > > > > > > >> > > > > > > > > > anything >> > > > > > > > > > >> > > > > > > > > > on Flink then. >> > > > > > > > > > >> > > > > > > > > > On Sun, Aug 4, 2019, 2:30 AM Semantic Beeng < >> > > > > > > > > > >> > > > > > > > > > n...@semanticbeeng.com> >> > > > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > >> +1 My money is on this approach. >> > > > > > > > > > >> >> > > > > > > > > > >> The existing abstractions from Beam seem enough for >> the >> > > use >> > > > > > > > > > >> > > > > > > > > > cases >> > > > > > > > > > >> > > > > > > > > > as I >> > > > > > > > > > >> > > > > > > > > > imagine them. >> > > > > > > > > > >> > > > > > > > > > >> Flink also has "dynamic table", "table source" and >> > "table >> > > > > > > > > > >> > > > > > > > > > sink" >> > > > > > > > > > >> > > > > > > > > > which >> > > > > > > > > > >> > > > > > > > > > seem very useful abstractions where Hudi might fit >> nicely. >> > > > > > > > > > >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/dynamic_tables.html >> > > > > > > > > > >> > > > > > > > > > >> >> > > > > > > > > > >> Attached a screen shot. >> > > > > > > > > > >> >> > > > > > > > > > >> This seems to fit with the original premise of Hudi >> as >> > > well. >> > > > > > > > > > >> >> > > > > > > > > > >> Am exploring this venue with a use case that involves >> > > > > > > > > > >> > > > > > > > > > "temporal >> > > > > > > > > > >> > > > > > > > > > joins >> > > > > > > > > > >> > > > > > > > > > on >> > > > > > > > > > >> > > > > > > > > > streams" which I need for feature extraction. >> > > > > > > > > > >> > > > > > > > > > >> Anyone is interested in this or has concrete enough >> > needs >> > > > > > > > > > >> > > > > > > > > > and >> > > > > > > > > > >> > > > > > > > > > use >> > > > > > > > > > >> > > > > > > > > > cases >> > > > > > > > > > >> > > > > > > > > > please let me know. >> > > > > > > > > > >> > > > > > > > > > >> Best to go from an agreed upon set of 2-3 use cases. >> > > > > > > > > > >> >> > > > > > > > > > >> Cheers >> > > > > > > > > > >> >> > > > > > > > > > >> Nick >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> > Also, we do have some Beam experts on the mailing >> > list.. >> > > > > > > > > > >> > > > > > > > > > Can >> > > > > > > > > > >> > > > > > > > > > you >> > > > > > > > > > >> > > > > > > > > > please >> > > > > > > > > > >> weigh on viability of using Beam as the intermediate >> > > > > > > > > > >> > > > > > > > > > abstraction >> > > > > > > > > > >> > > > > > > > > > here >> > > > > > > > > > >> > > > > > > > > > between Spark/Flink? >> > > > > > > > > > Hudi uses RDD apis like groupBy, mapToPair, >> > > > > > > > > > >> > > > > > > > > > sortAndRepartition, >> > > > > > > > > > >> > > > > > > > > > reduceByKey, countByKey and also does custom >> partitioning a >> > > > > > > > > > >> > > > > > > > > > lot.> >> > > > > > > > > > >> > > > > > > > > > >> > >> > > > > > > > > > >> >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >