Re: [DISCUSS] Decouple Hudi and Spark

vino yang Tue, 24 Sep 2019 19:22:50 -0700
Hi Taher, As I mentioned in the previous mail. Things may not be too easy by 
just using Flink state API. Copied here "Hudi can connect with many different 
Source/Sinks. Some file-based reads are not appropriate for Flink Streaming." 
Although, unify Batch and Streaming is Flink's goal. But, it is difficult to 
ignore Flink Batch API to match some features provide by Hudi now. The example 
you provided is in application layer about usage. So my suggestion is be 
patient, it needs time to give an detailed design. Best, Vino On 09/24/2019 
22:38, Taher Koitawala wrote: Hi All,              Sample code to see how 
records tagging will be handled in Flink is posted on [1]. The main class to 
run the same is MockHudi.java with a sample path for checkpointing. As of now 
this is just a sample to know we should ke caching in Flink states with bare 
minimum configs. As per my experience I have cached around 10s of TBs in Flink 
rocksDB state with the right configs. So I'm sure it should work here as well. 
1: 
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
 Regards, Taher Koitawala On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar 
<vin...@apache.org> wrote: > It wont be much different than the HBaseIndex we 
have today. Would like to > have always have an option like BloomIndex that 
does not need any external > dependencies. > The moment you bring an external 
data store in, someone becomes a DBA. :) > > On Sun, Sep 22, 2019 at 6:46 AM 
Semantic Beeng <n...@semanticbeeng.com> > wrote: > > > @vc can you see how 
ApacheCrail could be used to implement this at scale > > but also in a way that 
abstracts over both Spark and Flink? > > > > "Crail Store implements a 
hierarchical namespace across a cluster of RDMA > > interconnected storage 
resources such as DRAM or flash" > > > > 
https://crail.incubator.apache.org/overview/ > > > > + 2 cents > > 
https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > > 
Cheers > > > > Nick > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar 
<vin...@apache.org> > wrote: > > > > > > It could be much larger. :) imagine 
billions of keys each 32 bytes, > mapped > > to another 32 byte > > > > The 
advantage of the current bloom index is that its effectively stored > > with 
data itself and this reduces complexity in terms of keeping index > and > > 
data consistent etc > > > > One orthogonal idea from long time ago that moves 
indexing out of data > > storage and is generalizable > > > > 
https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > If someone 
here knows flink well and can implement some standalone flink > > code to mimic 
tagLocation() functionality and share with the group, that > > would be great. 
Lets worry about performance once we have a flink DAG. I > > think this is a 
critical and most tricky piece in supporting flink. > > > > On Sat, Sep 21, 
2019 at 4:17 AM Vinay Patil <vinay18.pa...@gmail.com> > > wrote: > > > > Hi 
Taher, > > > > I agree with this , if the state is becoming too large we should 
have an > > option of storing it in external state like File System or RocksDb. 
> > > > @Vinoth Chandar <vin...@apache.org> can the state of HoodieBloomIndex 
go > > beyond 10-15 GB > > > > Regards, > > Vinay Patil > > > > > > > > > On 
Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala <taher...@gmail.com> > > wrote: > 
> > > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex > 
> with > > >> HeapState, RocksDBState and FsState but on Spark. > > >> > > >> 
On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala <taher...@gmail.com> > > >> 
wrote: > > >> > > >> > Hi Vinoth, > > >> > Having seen the doc and code. I 
understand the > > >> > HoodieBloomIndex mainly caches key and partition path. 
Can we > address > > >> how > > >> > Flink does it? Like, have HeapState where 
the user chooses to cache > > the > > >> > Index on heap, RockDBState where 
indexes are written to RocksDB and > > >> finally > > >> > FsState where 
indexes can be written to HDFS, S3, Azure Fs. And on > > top, > > >> we > > >> 
> can do an index Time To Live. > > >> > > > >> > Regards, > > >> > Taher 
Koitawala > > >> > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar 
<vin...@apache.org> > > >> wrote: > > >> > > > >> >> I still feel the key thing 
here is reimplementing HoodieBloomIndex > > >> without > > >> >> needing spark 
caching. > > >> >> > > >> >> > > >> > > > 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global
 > > ) > > >> >> documents the spark DAG in detail. > > >> >> > > >> >> If 
everyone feels, it's best for me to scope the work out, then > happy > > >> to 
> > >> >> do > > >> >> it! > > >> >> > > >> >> On Mon, Sep 16, 2019 at 10:23 AM 
Taher Koitawala < > taher...@gmail.com > > > > > >> >> wrote: > > >> >> > > >> 
>> > Guys I think we are slowing down on this again. We need to start > > >> >> 
planning > > >> >> > small small tasks towards this VC please can you help fast 
track > > >> this? > > >> >> > > > >> >> > Regards, > > >> >> > Taher Koitawala 
> > >> >> > > > >> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar 
<vin...@apache.org > > > > >> >> wrote: > > >> >> > > > >> >> > > Look forward 
to the analysis. A key class to read would be > > >> >> > > HoodieBloomIndex, 
which uses a lot of spark caching and > shuffles. > > >> >> > > > > >> >> > > 
On Tue, Aug 13, 2019 at 7:52 PM vino yang < > yanghua1...@gmail.com > > > > > 
>> >> wrote: > > >> >> > > > > >> >> > > > >> Currently Spark Streaming micro 
batching fits well with > > Hudi, > > >> >> since > > >> >> > it > > >> >> > > 
> amortizes the cost of indexing, workload profiling etc. 1 > spark > > >> >> 
micro > > >> >> > > batch > > >> >> > > > = 1 hudi commit > > >> >> > > > With 
the per-record model in Flink, I am not sure how useful > it > > >> >> will be 
> > >> >> > > to > > >> >> > > > support hudi.. for e.g, 1 input record cannot 
be 1 hudi > commit, > > >> it > > >> >> will > > >> >> > > be > > >> >> > > > 
inefficient.. > > >> >> > > > > > >> >> > > > Yes, if 1 input record = 1 hudi 
commit, it would be > > inefficient. > > >> >> About > > >> >> > > > Flink 
streaming, we can also implement the "batch&quo
Re: [DISCUSS] Decouple Hudi and Spark

Reply via email to