Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

vino yang Wed, 25 Sep 2019 20:17:38 -0700
Hi A simple example. In Hudi Project, you can find many code snippet like 
`spark.read().format().load()` The load method can pass any path, especially 
DFS paths. While if we only want to use Flink streaming, there is not a good 
way to read HDFS now. In addition, we.also need to consider other ability 
between Flink and Spark. You should know Spark API(non-structured streaming 
mode) can support both Streaming(micro-batch) and batch. However, Flink 
distinguishs them with two differentAPI, they have different feature set. On 
09/25/2019 13:15, Semantic Beeng wrote: Hi Vino, Would you be kind to start a 
wiki page to discuss this deep understanding of the functionality and design of 
Hudi? There you can put git links (https://github.com/ben-gibson/GitLink for 
intellij) and design knowledge so we can discuss in context. I am exploring the 
approach from this retweet 
https://twitter.com/semanticbeeng/status/1176241250967666689?s=20 and need this 
understanding you have. "difficult to ignore Flink Batch API to match some 
features provide by Hudi now" - can you please post there some gitlinks to 
this? Thanks Nick On September 24, 2019 at 10:22 PM vino yang 
<[email protected]> wrote: Hi Taher, As I mentioned in the previous mail. 
Things may not be too easy by just using Flink state API. Copied here " Hudi 
can connect with many different Source/Sinks. Some file-based reads are not 
appropriate for Flink Streaming." Although, unify Batch and Streaming is 
Flink's goal. But, it is difficult to ignore Flink Batch API to match some 
features provide by Hudi now. The example you provided is in application layer 
about usage. So my suggestion is be patient, it needs time to give an detailed 
design. Best, Vino On 09/24/2019 22:38, Taher Koitawala wrote:Hi All,           
   Sample code to see how records tagging will be handled in Flink is posted on 
[1]. The main class to run the same is MockHudi.java with a sample path for 
checkpointing. As of now this is just a sample to know we should ke caching in 
Flink states with bare minimum configs. As per my experience I have cached 
around 10s of TBs in Flink rocksDB state with the right configs. So I'm sure it 
should work here as well. 1: 
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi
 Regards, Taher Koitawala On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar 
<[email protected]> wrote: > It wont be much different than the HBaseIndex we 
have today. Would like to > have always have an option like BloomIndex that 
does not need any external > dependencies. > The moment you bring an external 
data store in, someone becomes a DBA. :) > > On Sun, Sep 22, 2019 at 6:46 AM 
Semantic Beeng <[email protected]> > wrote: > > > @vc can you see how 
ApacheCrail could be used to implement this at scale > > but also in a way that 
abstracts over both Spark and Flink? > > > > "Crail Store implements a 
hierarchical namespace across a cluster of RDMA > > interconnected storage 
resources such as DRAM or flash" > > > > 
https://crail.incubator.apache.org/overview/ > > > > + 2 cents > > 
https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > > 
Cheers > > > > Nick > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar 
<[email protected]> > wrote: > > > > > > It could be much larger. :) imagine 
billions of keys each 32 bytes, > mapped > > to another 32 byte > > > > The 
advantage of the current bloom index is that its effectively stored > > with 
data itself and this reduces complexity in terms of keeping index > and > > 
data consistent etc > > > > One orthogonal idea from long time ago that moves 
indexing out of data > > storage and is generalizable > > > > 
https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > If someone 
here knows flink well and can implement some standalone flink > > code to mimic 
tagLocation() functionality and share with the group, that > > would be great. 
Lets worry about performance once we have a flink DAG. I > > think this is a 
critical and most tricky piece in supporting flink. > > > > On Sat, Sep 21, 
2019 at 4:17 AM Vinay Patil <[email protected]> > > wrote: > > > > Hi 
Taher, > > > > I agree with this , if the state is becoming too large we should 
have an > > option of storing it in external state like File System or RocksDb. 
> > > > @Vinoth Chandar <[email protected]> can the state of HoodieBloomIndex 
go > > beyond 10-15 GB > > > > Regards, > > Vinay Patil > > > > > > > > > On 
Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala <[email protected]> > > wrote: > 
> > > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex > 
> with > > >> HeapState, RocksDBState and FsState but on Spark. > > >> > > >> 
On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala <[email protected]> > > >> 
wrote: > > >> > > >> > Hi Vinoth, > > >> > Having seen the doc and code. I 
understand the > > >> > HoodieBloomIndex mainly caches key and partition path. 
Can we > address > > >> how > > >> > Flink does it? Like, have HeapState where 
the user chooses to cache > > the > > >> > Index on heap, RockDBState where 
indexes are written to RocksDB and > > >> finally > > >> > FsState where 
indexes can be written to HDFS, S3, Azure Fs. And on > > top, > > >> we > > >> 
> can do an index Time To Live. > > >> > > > >> > Regards, > > >> > Taher 
Koitawala > > >> > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar 
<[email protected]> > > >> wrote: > > >> > > > >> >> I still feel the key thing 
here is reimplementing HoodieBloomIndex > > >> without > > >> >> needing spark 
caching. > > >> >> > > >> >> > > >> > > > 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global
 > > ) > > >> >> documents the spark DAG in detail. > > >> >> > > >> >> If 
everyone feels, it's best for me to scope the work out, then > happy > > >> to 
> > >> >> do > > >> >> it! > > >> >> > > >> >> On Mon, Sep 16, 2019 at 10:23 AM 
Taher Koitawala < > [email protected] > > > > > >> >> wrote: > > >> >> > > >&#
Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

Reply via email to