Hi A simple example. In Hudi Project, you can find many code snippet like
`spark.read().format().load()` The load method can pass any path, especially
DFS paths. While if we only want to use Flink streaming, there is not a good
way to read HDFS now. In addition, we.also need to consider other ability
between Flink and Spark. You should know Spark API(non-structured streaming
mode) can support both Streaming(micro-batch) and batch. However, Flink
distinguishs them with two differentAPI, they have different feature set. On
09/25/2019 13:15, Semantic Beeng wrote: Hi Vino, Would you be kind to start a
wiki page to discuss this deep understanding of the functionality and design of
Hudi? There you can put git links (https://github.com/ben-gibson/GitLink for
intellij) and design knowledge so we can discuss in context. I am exploring the
approach from this retweet
https://twitter.com/semanticbeeng/status/1176241250967666689?s=20 and need this
understanding you have. "difficult to ignore Flink Batch API to match some
features provide by Hudi now" - can you please post there some gitlinks to
this? Thanks Nick On September 24, 2019 at 10:22 PM vino yang
<[email protected]> wrote: Hi Taher, As I mentioned in the previous mail.
Things may not be too easy by just using Flink state API. Copied here " Hudi
can connect with many different Source/Sinks. Some file-based reads are not
appropriate for Flink Streaming." Although, unify Batch and Streaming is
Flink's goal. But, it is difficult to ignore Flink Batch API to match some
features provide by Hudi now. The example you provided is in application layer
about usage. So my suggestion is be patient, it needs time to give an detailed
design. Best, Vino On 09/24/2019 22:38, Taher Koitawala wrote:Hi All,
Sample code to see how records tagging will be handled in Flink is posted on
[1]. The main class to run the same is MockHudi.java with a sample path for
checkpointing. As of now this is just a sample to know we should ke caching in
Flink states with bare minimum configs. As per my experience I have cached
around 10s of TBs in Flink rocksDB state with the right configs. So I'm sure it
should work here as well. 1:
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
Regards, Taher Koitawala On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar
<[email protected]> wrote: > It wont be much different than the HBaseIndex we
have today. Would like to > have always have an option like BloomIndex that
does not need any external > dependencies. > The moment you bring an external
data store in, someone becomes a DBA. :) > > On Sun, Sep 22, 2019 at 6:46 AM
Semantic Beeng <[email protected]> > wrote: > > > @vc can you see how
ApacheCrail could be used to implement this at scale > > but also in a way that
abstracts over both Spark and Flink? > > > > "Crail Store implements a
hierarchical namespace across a cluster of RDMA > > interconnected storage
resources such as DRAM or flash" > > > >
https://crail.incubator.apache.org/overview/ > > > > + 2 cents > >
https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > >
Cheers > > > > Nick > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar
<[email protected]> > wrote: > > > > > > It could be much larger. :) imagine
billions of keys each 32 bytes, > mapped > > to another 32 byte > > > > The
advantage of the current bloom index is that its effectively stored > > with
data itself and this reduces complexity in terms of keeping index > and > >
data consistent etc > > > > One orthogonal idea from long time ago that moves
indexing out of data > > storage and is generalizable > > > >
https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > If someone
here knows flink well and can implement some standalone flink > > code to mimic
tagLocation() functionality and share with the group, that > > would be great.
Lets worry about performance once we have a flink DAG. I > > think this is a
critical and most tricky piece in supporting flink. > > > > On Sat, Sep 21,
2019 at 4:17 AM Vinay Patil <[email protected]> > > wrote: > > > > Hi
Taher, > > > > I agree with this , if the state is becoming too large we should
have an > > option of storing it in external state like File System or RocksDb.
> > > > @Vinoth Chandar <[email protected]> can the state of HoodieBloomIndex
go > > beyond 10-15 GB > > > > Regards, > > Vinay Patil > > > > > > > > > On
Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala <[email protected]> > > wrote: >
> > > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex >
> with > > >> HeapState, RocksDBState and FsState but on Spark. > > >> > > >>
On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala <[email protected]> > > >>
wrote: > > >> > > >> > Hi Vinoth, > > >> > Having seen the doc and code. I
understand the > > >> > HoodieBloomIndex mainly caches key and partition path.
Can we > address > > >> how > > >> > Flink does it? Like, have HeapState where
the user chooses to cache > > the > > >> > Index on heap, RockDBState where
indexes are written to RocksDB and > > >> finally > > >> > FsState where
indexes can be written to HDFS, S3, Azure Fs. And on > > top, > > >> we > > >>
> can do an index Time To Live. > > >> > > > >> > Regards, > > >> > Taher
Koitawala > > >> > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar
<[email protected]> > > >> wrote: > > >> > > > >> >> I still feel the key thing
here is reimplementing HoodieBloomIndex > > >> without > > >> >> needing spark
caching. > > >> >> > > >> >> > > >> > > >
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global
> > ) > > >> >> documents the spark DAG in detail. > > >> >> > > >> >> If
everyone feels, it's best for me to scope the work out, then > happy > > >> to
> > >> >> do > > >> >> it! > > >> >> > > >> >> On Mon, Sep 16, 2019 at 10:23 AM
Taher Koitawala < > [email protected] > > > > > >> >> wrote: > > >> >> > > >&#