Hi Taher, As I mentioned in the previous mail. Things may not be too easy by
just using Flink state API. Copied here "Hudi can connect with many different
Source/Sinks. Some file-based reads are not appropriate for Flink Streaming."
Although, unify Batch and Streaming is Flink's goal. But, it is difficult to
ignore Flink Batch API to match some features provide by Hudi now. The example
you provided is in application layer about usage. So my suggestion is be
patient, it needs time to give an detailed design. Best, Vino On 09/24/2019
22:38, Taher Koitawala wrote: Hi All, Sample code to see how
records tagging will be handled in Flink is posted on [1]. The main class to
run the same is MockHudi.java with a sample path for checkpointing. As of now
this is just a sample to know we should ke caching in Flink states with bare
minimum configs. As per my experience I have cached around 10s of TBs in Flink
rocksDB state with the right configs. So I'm sure it should work here as well.
1:
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
Regards, Taher Koitawala On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar
<vin...@apache.org> wrote: > It wont be much different than the HBaseIndex we
have today. Would like to > have always have an option like BloomIndex that
does not need any external > dependencies. > The moment you bring an external
data store in, someone becomes a DBA. :) > > On Sun, Sep 22, 2019 at 6:46 AM
Semantic Beeng <n...@semanticbeeng.com> > wrote: > > > @vc can you see how
ApacheCrail could be used to implement this at scale > > but also in a way that
abstracts over both Spark and Flink? > > > > "Crail Store implements a
hierarchical namespace across a cluster of RDMA > > interconnected storage
resources such as DRAM or flash" > > > >
https://crail.incubator.apache.org/overview/ > > > > + 2 cents > >
https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > >
Cheers > > > > Nick > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar
<vin...@apache.org> > wrote: > > > > > > It could be much larger. :) imagine
billions of keys each 32 bytes, > mapped > > to another 32 byte > > > > The
advantage of the current bloom index is that its effectively stored > > with
data itself and this reduces complexity in terms of keeping index > and > >
data consistent etc > > > > One orthogonal idea from long time ago that moves
indexing out of data > > storage and is generalizable > > > >
https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > If someone
here knows flink well and can implement some standalone flink > > code to mimic
tagLocation() functionality and share with the group, that > > would be great.
Lets worry about performance once we have a flink DAG. I > > think this is a
critical and most tricky piece in supporting flink. > > > > On Sat, Sep 21,
2019 at 4:17 AM Vinay Patil <vinay18.pa...@gmail.com> > > wrote: > > > > Hi
Taher, > > > > I agree with this , if the state is becoming too large we should
have an > > option of storing it in external state like File System or RocksDb.
> > > > @Vinoth Chandar <vin...@apache.org> can the state of HoodieBloomIndex
go > > beyond 10-15 GB > > > > Regards, > > Vinay Patil > > > > > > > > > On
Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala <taher...@gmail.com> > > wrote: >
> > > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex >
> with > > >> HeapState, RocksDBState and FsState but on Spark. > > >> > > >>
On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala <taher...@gmail.com> > > >>
wrote: > > >> > > >> > Hi Vinoth, > > >> > Having seen the doc and code. I
understand the > > >> > HoodieBloomIndex mainly caches key and partition path.
Can we > address > > >> how > > >> > Flink does it? Like, have HeapState where
the user chooses to cache > > the > > >> > Index on heap, RockDBState where
indexes are written to RocksDB and > > >> finally > > >> > FsState where
indexes can be written to HDFS, S3, Azure Fs. And on > > top, > > >> we > > >>
> can do an index Time To Live. > > >> > > > >> > Regards, > > >> > Taher
Koitawala > > >> > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar
<vin...@apache.org> > > >> wrote: > > >> > > > >> >> I still feel the key thing
here is reimplementing HoodieBloomIndex > > >> without > > >> >> needing spark
caching. > > >> >> > > >> >> > > >> > > >
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global
> > ) > > >> >> documents the spark DAG in detail. > > >> >> > > >> >> If
everyone feels, it's best for me to scope the work out, then > happy > > >> to
> > >> >> do > > >> >> it! > > >> >> > > >> >> On Mon, Sep 16, 2019 at 10:23 AM
Taher Koitawala < > taher...@gmail.com > > > > > >> >> wrote: > > >> >> > > >>
>> > Guys I think we are slowing down on this again. We need to start > > >> >>
planning > > >> >> > small small tasks towards this VC please can you help fast
track > > >> this? > > >> >> > > > >> >> > Regards, > > >> >> > Taher Koitawala
> > >> >> > > > >> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar
<vin...@apache.org > > > > >> >> wrote: > > >> >> > > > >> >> > > Look forward
to the analysis. A key class to read would be > > >> >> > > HoodieBloomIndex,
which uses a lot of spark caching and > shuffles. > > >> >> > > > > >> >> > >
On Tue, Aug 13, 2019 at 7:52 PM vino yang < > yanghua1...@gmail.com > > > > >
>> >> wrote: > > >> >> > > > > >> >> > > > >> Currently Spark Streaming micro
batching fits well with > > Hudi, > > >> >> since > > >> >> > it > > >> >> > >
> amortizes the cost of indexing, workload profiling etc. 1 > spark > > >> >>
micro > > >> >> > > batch > > >> >> > > > = 1 hudi commit > > >> >> > > > With
the per-record model in Flink, I am not sure how useful > it > > >> >> will be
> > >> >> > > to > > >> >> > > > support hudi.. for e.g, 1 input record cannot
be 1 hudi > commit, > > >> it > > >> >> will > > >> >> > > be > > >> >> > > >
inefficient.. > > >> >> > > > > > >> >> > > > Yes, if 1 input record = 1 hudi
commit, it would be > > inefficient. > > >> >> About > > >> >> > > > Flink
streaming, we can also implement the "batch&quo