Re: [DISCUSS] Decouple Hudi and Spark
Hi Vino, This is not a design for Hudi on Flink. This was simply a mock up of tagLocations() spark cache to Flink state as Vinoth wanted to see. As per the Flink batch and Streaming I am well aware of the batch and Stream unification efforts of Flink. However I think that is still on progress and for now to batch on Hudi let's stick to Spark, however only for streaming let's use Flink streaming. Regards, Taher Koitawala On Wed, Sep 25, 2019, 7:52 AM vino yang wrote: > Hi Taher, > > As I mentioned in the previous mail. Things may not be too easy by just > using Flink state API. > > Copied here "Hudi can connect with many different Source/Sinks. Some > file-based reads are not appropriate for Flink Streaming." > > Although, unify Batch and Streaming is Flink's goal. But, it is difficult > to ignore Flink Batch API to match some features provide by Hudi now. > > The example you provided is in application layer about usage. So my > suggestion is be patient, it needs time to give an detailed design. > > Best, > Vino > > > > On 09/24/2019 22:38, Taher Koitawala wrote: > Hi All, > Sample code to see how records tagging will be handled in > Flink is posted on [1]. The main class to run the same is MockHudi.java > with a sample path for checkpointing. > > As of now this is just a sample to know we should ke caching in Flink > states with bare minimum configs. > > > As per my experience I have cached around 10s of TBs in Flink rocksDB > state > with the right configs. So I'm sure it should work here as well. > > 1: > > https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi > > Regards, > Taher Koitawala > > > On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar wrote: > > > It wont be much different than the HBaseIndex we have today. Would like > to > > have always have an option like BloomIndex that does not need any > external > > dependencies. > > The moment you bring an external data store in, someone becomes a DBA. > :) > > > > On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng > > wrote: > > > > > @vc can you see how ApacheCrail could be used to implement this at > scale > > > but also in a way that abstracts over both Spark and Flink? > > > > > > "Crail Store implements a hierarchical namespace across a cluster of > RDMA > > > interconnected storage resources such as DRAM or flash" > > > > > > https://crail.incubator.apache.org/overview/ > > > > > > + 2 cents > > > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > > > > Cheers > > > > > > Nick > > > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar > > wrote: > > > > > > > > > It could be much larger. :) imagine billions of keys each 32 bytes, > > mapped > > > to another 32 byte > > > > > > The advantage of the current bloom index is that its effectively > stored > > > with data itself and this reduces complexity in terms of keeping index > > and > > > data consistent etc > > > > > > One orthogonal idea from long time ago that moves indexing out of data > > > storage and is generalizable > > > > > > https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > > > If someone here knows flink well and can implement some standalone > flink > > > code to mimic tagLocation() functionality and share with the group, > that > > > would be great. Lets worry about performance once we have a flink DAG. > I > > > think this is a critical and most tricky piece in supporting flink. > > > > > > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil > > > wrote: > > > > > > Hi Taher, > > > > > > I agree with this , if the state is becoming too large we should have > an > > > option of storing it in external state like File System or RocksDb. > > > > > > @Vinoth Chandar can the state of HoodieBloomIndex > go > > > beyond 10-15 GB > > > > > > Regards, > > > Vinay Patil > > > > > > > > > > > > > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala > > > wrote: > > > > > > >> Hey Guys, Any thoughts on the above idea? To handle > HoodieBloomIndex > > > with > > > >> HeapState, RocksDBState and FsState but on Spark. > > > >> > > > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala > > > > >> wrote: > > > >> > > > >> > Hi Vinoth, > > > >> > Having seen the doc and code. I understand the > > > >> > HoodieBloomIndex mainly caches key and partition path. Can we > > address > > > >> how > > > >> > Flink does it? Like, have HeapState where the user chooses to > cache > > > the > > > >> > Index on heap, RockDBState where indexes are written to RocksDB > and > > > >> finally > > > >> > FsState where indexes can be written to HDFS, S3, Azure Fs. And > on > > > top, > > > >> we > > > >> > can do an index Time To Live. > > > >> > > > > >> > Regards, > > > >> > Taher Koitawala > > > >> > > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar < > vin...@apache.org> > > > >> wrote: > > > >> > > > > >> >> I still feel the key thing here is reimplementing > HoodieBloomIndex > > > >> without > > > >> >> needing
Re: [DISCUSS] Decouple Hudi and Spark
Hi Taher, As I mentioned in the previous mail. Things may not be too easy by just using Flink state API. Copied here "Hudi can connect with many different Source/Sinks. Some file-based reads are not appropriate for Flink Streaming." Although, unify Batch and Streaming is Flink's goal. But, it is difficult to ignore Flink Batch API to match some features provide by Hudi now. The example you provided is in application layer about usage. So my suggestion is be patient, it needs time to give an detailed design. Best, Vino On 09/24/2019 22:38, Taher Koitawala wrote: Hi All, Sample code to see how records tagging will be handled in Flink is posted on [1]. The main class to run the same is MockHudi.java with a sample path for checkpointing. As of now this is just a sample to know we should ke caching in Flink states with bare minimum configs. As per my experience I have cached around 10s of TBs in Flink rocksDB state with the right configs. So I'm sure it should work here as well. 1: https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi Regards, Taher Koitawala On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar wrote: > It wont be much different than the HBaseIndex we have today. Would like to > have always have an option like BloomIndex that does not need any external > dependencies. > The moment you bring an external data store in, someone becomes a DBA. :) > > On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng > wrote: > > > @vc can you see how ApacheCrail could be used to implement this at scale > > but also in a way that abstracts over both Spark and Flink? > > > > "Crail Store implements a hierarchical namespace across a cluster of RDMA > > interconnected storage resources such as DRAM or flash" > > > > https://crail.incubator.apache.org/overview/ > > > > + 2 cents > > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > > Cheers > > > > Nick > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar > wrote: > > > > > > It could be much larger. :) imagine billions of keys each 32 bytes, > mapped > > to another 32 byte > > > > The advantage of the current bloom index is that its effectively stored > > with data itself and this reduces complexity in terms of keeping index > and > > data consistent etc > > > > One orthogonal idea from long time ago that moves indexing out of data > > storage and is generalizable > > > > https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > If someone here knows flink well and can implement some standalone flink > > code to mimic tagLocation() functionality and share with the group, that > > would be great. Lets worry about performance once we have a flink DAG. I > > think this is a critical and most tricky piece in supporting flink. > > > > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil > > wrote: > > > > Hi Taher, > > > > I agree with this , if the state is becoming too large we should have an > > option of storing it in external state like File System or RocksDb. > > > > @Vinoth Chandar can the state of HoodieBloomIndex go > > beyond 10-15 GB > > > > Regards, > > Vinay Patil > > > > > > > > > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala > > wrote: > > > > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex > > with > > >> HeapState, RocksDBState and FsState but on Spark. > > >> > > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala > > >> wrote: > > >> > > >> > Hi Vinoth, > > >> > Having seen the doc and code. I understand the > > >> > HoodieBloomIndex mainly caches key and partition path. Can we > address > > >> how > > >> > Flink does it? Like, have HeapState where the user chooses to cache > > the > > >> > Index on heap, RockDBState where indexes are written to RocksDB and > > >> finally > > >> > FsState where indexes can be written to HDFS, S3, Azure Fs. And on > > top, > > >> we > > >> > can do an index Time To Live. > > >> > > > >> > Regards, > > >> > Taher Koitawala > > >> > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar > > >> wrote: > > >> > > > >> >> I still feel the key thing here is reimplementing HoodieBloomIndex > > >> without > > >> >> needing spark caching. > > >> >> > > >> >> > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global > > ) > > >> >> documents the spark DAG in detail. > > >> >> > > >> >> If everyone feels, it's best for me to scope the work out, then > happy > > >> to > > >> >> do > > >> >> it! > > >> >> > > >> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala < > taher...@gmail.com > > > > > >> >> wrote: > > >> >> > > >> >> > Guys I think we are slowing down on this again. We need to start > > >> >> planning > > >> >> > small small tasks towards this VC please can you help fast track > > >> this? > > >> >> > > > >> >> > Regards, > > >> >> > Taher Koitawala > > >> >> > > > >> >> > On Thu, Aug 15, 2019, 10:07 AM
Re: FAQ page
I have read the FAQ page. It looks good to me. There are many valuable and high freequency questions. I have a suggestion. Besides Hudi, there are another two projects towards data lake: Iceberg and Delta Lake. If we can give some comparation between Hudi and them. It would be good. It is a high freequency question which has been asked to me. What do you think? On 09/25/2019 01:16, Bhavani Sudha Saktheeswaran wrote: This is really cool. Thanks for putting this page together Vinoth ! On Tue, Sep 24, 2019 at 7:39 AM Nishith wrote: > The FAQ looks awesome Vinoth! Answers most of the questions that folks are > confused about. > Hoping folks can contribute more as we uncover more frequently asked > questions. > > - Nishith > > Sent from my iPhone > > > On Sep 23, 2019, at 5:51 PM, vino yang wrote: > > > > Thanks for your great work, Vinoth and Nishith. Will have a look soon. > On 09/24/2019 07:51, vbal...@apache.org wrote: +1 Awesome job Vinoth and > Nishith for compiling the initial version of FAQ. Agree on the idea of > replying using FAQ. Balaji.V On Monday, September 23, 2019, 04:41:03 PM > PDT, Vinoth Chandar wrote: First version of the > page is now fully completed. > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo= > Please try to use the FAQs when answering questions on ML and GH. It will > only get better if we manage this effectively and keep improving it. On > Sun, Sep 15, 2019 at 9:41 PM Vinoth Chandar wrote: > > Thanks! Will work this week to fill out most answers! > Your help reviewing > would also be much appreciated. > Will keep this thread posted.. > > On > Tue, Sep 10, 2019 at 6:10 PM vino yang wrote: > > >> Hi Vinoth, >> >> Great job! Thanks for your efforts! >> I think this > page is good for users and developers to let them know Hudi >> well. >> >> > Best, >> Vino >> >> >> >> Vinoth Chandar > 于2019年9月11日周三 上午2:27写道: >> >> > Hi all, >> > >> > I wrote a list of > questions based on mailing list conversations and >> issues. >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo= > >> > < >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo= > >> > > >> > >> > While I am still working through answers, I thought this > can be a good >> > community driven process. >> > >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185-23Frequentlyaskedquestions-28FAQ-29-2DContributingtoFAQ=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=z9N-muShuD6vVio51IJSiTssAaxJ5OdkN97slqOo8qU= > >> > < >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185-23Frequentlyaskedquestions-28FAQ-29-2DContributingtoFAQ=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=z9N-muShuD6vVio51IJSiTssAaxJ5OdkN97slqOo8qU= > >> > > >> > >> > Please help by contributing answers or new questions if > you can! >> > >> > thanks >> > vinoth >> > >> > >
Re: FAQ page
This is really cool. Thanks for putting this page together Vinoth ! On Tue, Sep 24, 2019 at 7:39 AM Nishith wrote: > The FAQ looks awesome Vinoth! Answers most of the questions that folks are > confused about. > Hoping folks can contribute more as we uncover more frequently asked > questions. > > - Nishith > > Sent from my iPhone > > > On Sep 23, 2019, at 5:51 PM, vino yang wrote: > > > > Thanks for your great work, Vinoth and Nishith. Will have a look soon. > On 09/24/2019 07:51, vbal...@apache.org wrote: +1 Awesome job Vinoth and > Nishith for compiling the initial version of FAQ. Agree on the idea of > replying using FAQ. Balaji.VOn Monday, September 23, 2019, 04:41:03 PM > PDT, Vinoth Chandar wrote: First version of the > page is now fully completed. > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo= > Please try to use the FAQs when answering questions on ML and GH. It will > only get better if we manage this effectively and keep improving it. On > Sun, Sep 15, 2019 at 9:41 PM Vinoth Chandar wrote: > > Thanks! Will work this week to fill out most answers! > Your help reviewing > would also be much appreciated. > Will keep this thread posted.. > > On > Tue, Sep 10, 2019 at 6:10 PM vino yang wrote: > > >> Hi Vinoth, >> >> Great job! Thanks for your efforts! >> I think this > page is good for users and developers to let them know Hudi >> well. >> >> > Best, >> Vino >> >> >> >> Vinoth Chandar > 于2019年9月11日周三 上午2:27写道: >> >> > Hi all, >> > >> > I wrote a list of > questions based on mailing list conversations and >> issues. >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo= > >> > < >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=rN50nugTyvvax4iJAZVfBMt5dH4SnO_yy9F1PX60QGo= > >> > > >> > >> > While I am still working through answers, I thought this > can be a good >> > community driven process. >> > >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185-23Frequentlyaskedquestions-28FAQ-29-2DContributingtoFAQ=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=z9N-muShuD6vVio51IJSiTssAaxJ5OdkN97slqOo8qU= > >> > < >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_pages_viewpage.action-3FpageId-3D113709185-23Frequentlyaskedquestions-28FAQ-29-2DContributingtoFAQ=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0=TL83bPj9VSaQsFAaw-u4U4sPgfehW03c6c1JIPLV8XQ=z9N-muShuD6vVio51IJSiTssAaxJ5OdkN97slqOo8qU= > >> > > >> > >> > Please help by contributing answers or new questions if > you can! >> > >> > thanks >> > vinoth >> > >> > >
Re: [DISCUSS] Decouple Hudi and Spark
Hi All, Sample code to see how records tagging will be handled in Flink is posted on [1]. The main class to run the same is MockHudi.java with a sample path for checkpointing. As of now this is just a sample to know we should ke caching in Flink states with bare minimum configs. As per my experience I have cached around 10s of TBs in Flink rocksDB state with the right configs. So I'm sure it should work here as well. 1: https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/org/apache/hudi Regards, Taher Koitawala On Sun, Sep 22, 2019, 7:34 PM Vinoth Chandar wrote: > It wont be much different than the HBaseIndex we have today. Would like to > have always have an option like BloomIndex that does not need any external > dependencies. > The moment you bring an external data store in, someone becomes a DBA. :) > > On Sun, Sep 22, 2019 at 6:46 AM Semantic Beeng > wrote: > > > @vc can you see how ApacheCrail could be used to implement this at scale > > but also in a way that abstracts over both Spark and Flink? > > > > "Crail Store implements a hierarchical namespace across a cluster of RDMA > > interconnected storage resources such as DRAM or flash" > > > > https://crail.incubator.apache.org/overview/ > > > > + 2 cents > > https://twitter.com/semanticbeeng/status/1175767500790915072?s=20 > > > > Cheers > > > > Nick > > > > On September 22, 2019 at 9:28 AM Vinoth Chandar > wrote: > > > > > > It could be much larger. :) imagine billions of keys each 32 bytes, > mapped > > to another 32 byte > > > > The advantage of the current bloom index is that its effectively stored > > with data itself and this reduces complexity in terms of keeping index > and > > data consistent etc > > > > One orthogonal idea from long time ago that moves indexing out of data > > storage and is generalizable > > > > https://github.com/apache/incubator-hudi/wiki/HashMap-Index > > > > If someone here knows flink well and can implement some standalone flink > > code to mimic tagLocation() functionality and share with the group, that > > would be great. Lets worry about performance once we have a flink DAG. I > > think this is a critical and most tricky piece in supporting flink. > > > > On Sat, Sep 21, 2019 at 4:17 AM Vinay Patil > > wrote: > > > > Hi Taher, > > > > I agree with this , if the state is becoming too large we should have an > > option of storing it in external state like File System or RocksDb. > > > > @Vinoth Chandar can the state of HoodieBloomIndex go > > beyond 10-15 GB > > > > Regards, > > Vinay Patil > > > > > > > > > On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala > > wrote: > > > > >> Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex > > with > > >> HeapState, RocksDBState and FsState but on Spark. > > >> > > >> On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala > > >> wrote: > > >> > > >> > Hi Vinoth, > > >> > Having seen the doc and code. I understand the > > >> > HoodieBloomIndex mainly caches key and partition path. Can we > address > > >> how > > >> > Flink does it? Like, have HeapState where the user chooses to cache > > the > > >> > Index on heap, RockDBState where indexes are written to RocksDB and > > >> finally > > >> > FsState where indexes can be written to HDFS, S3, Azure Fs. And on > > top, > > >> we > > >> > can do an index Time To Live. > > >> > > > >> > Regards, > > >> > Taher Koitawala > > >> > > > >> > On Mon, Sep 16, 2019 at 11:43 PM Vinoth Chandar > > >> wrote: > > >> > > > >> >> I still feel the key thing here is reimplementing HoodieBloomIndex > > >> without > > >> >> needing spark caching. > > >> >> > > >> >> > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global > > ) > > >> >> documents the spark DAG in detail. > > >> >> > > >> >> If everyone feels, it's best for me to scope the work out, then > happy > > >> to > > >> >> do > > >> >> it! > > >> >> > > >> >> On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala < > taher...@gmail.com > > > > > >> >> wrote: > > >> >> > > >> >> > Guys I think we are slowing down on this again. We need to start > > >> >> planning > > >> >> > small small tasks towards this VC please can you help fast track > > >> this? > > >> >> > > > >> >> > Regards, > > >> >> > Taher Koitawala > > >> >> > > > >> >> > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar > > > >> >> wrote: > > >> >> > > > >> >> > > Look forward to the analysis. A key class to read would be > > >> >> > > HoodieBloomIndex, which uses a lot of spark caching and > shuffles. > > >> >> > > > > >> >> > > On Tue, Aug 13, 2019 at 7:52 PM vino yang < > yanghua1...@gmail.com > > > > > >> >> wrote: > > >> >> > > > > >> >> > > > >> Currently Spark Streaming micro batching fits well with > > Hudi, > > >> >> since > > >> >> > it > > >> >> > > > amortizes the cost of indexing, workload profiling etc. 1 > spark > > >> >> micro > > >> >> > > batch > > >> >> > > > = 1 hudi commit > > >> >> > > > With the
Re: [PROPOSAL] Hudi Web UI
Thanks for doing it! Will review sometime this week. On Mon, Sep 23, 2019 at 5:43 PM vino yang wrote: > Thanks Taher, great job! Will have another look soon. Best, Vino On > 09/24/2019 02:15, Taher Koitawala wrote: Hi All, Hip has been > migrated to confluence. Please take a look. > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233 > Regards, Taher Koitawala On Mon, Sep 23, 2019 at 10:17 PM Taher Koitawala < > taher...@gmail.com> wrote: > Yup got it. Thanks Vinoth > > On Mon, Sep > 23, 2019, 10:02 PM Vinoth Chandar wrote: > >> Now? >> > >> On Mon, Sep 23, 2019 at 9:05 AM Taher Koitawala > >> wrote: >> >> > Hey Vinoth still the same. No perms >> > >> > On Mon, Sep > 23, 2019 at 9:24 PM Vinoth Chandar >> wrote: >> > >> > > > thats correct. you should have perms now. >> > > >> > > On Sun, Sep 22, > 2019 at 11:06 PM Taher Koitawala >> > > wrote: >> > > > >> > > > Hi Vinoth, >> > > >I do not have access on > confluence to create a page >> in >> > the >> > > > HIP section or even > copy the HIP template. Please can you give >> access? >> > My >> > > > id > is taherk77. >> > > > >> > > > Regards, >> > > > Taher Koitawala >> > > > > >> > > > On Sun, Sep 22, 2019 at 6:34 PM Vinoth Chandar > >> > > wrote: >> > > > >> > > > > Taher, can we please move the HIP to the > cWiki space as documented >> > here >> > > > > >> > > > > >> > > > >> > > > >> > >> > https://cwiki.apache.org/confluence/display/HUDI/Hudi+Improvement+Plan+Details+and+Process > >> > > > > >> > > > > >> > > > > Would love to take a pass at it. This will > definitely improve >> > > usability.. >> > > > > >> > > > > @leesf I think > we can use the standalone timeline server bundle >> you >> > > > worked >> > > > > > on, and have a central timeline server for all Hudi jobs within >> > your >> > > > org.. >> > > > > Then, if you host an UI endpoint on that > server, we can visualize >> > much >> > > of >> > > > > what Tahen put up > on the doc. thoughts? >> > > > > >> > > > > On Sun, Sep 22, 2019 at 4:07 AM > Taher Koitawala < >> taher...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > > Hi Leesf, >> > > > > >Thank you for your interest. > HIP already has been >> > > > implemented >> > > > > in >> > > > > > terms > of design and components we need to see. The link is given >> > > below. >> > > > > > > Please free to chime in, to add and implement. >> > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > https://docs.google.com/document/d/1oEjukuaK2ltqiD0sjVs5IUzvDzilWF0viwMASXTdEd4/edit?usp=sharing > >> > > > > > >> > > > > > Regards, >> > > > > > Taher Koitawala >> > > > > > > >> > > > > > On Sun, Sep 22, 2019, 3:26 PM leesf > >> wrote: >> > > > > > >> > > > > > > + 1 to this valuable HIP. >> > > > > > > > And I am also very interested, perhaps together to implement >> this >> > > > > HIP. >> > > > > > > >> > > > > > > Best, >> > > > > > > Leesf >> > > > > > > > >> > > > > > > Taher Koitawala 于2019年9月22日周日 > 下午2:09写道: >> > > > > > > >> > > > > > > > Guys it not only includes tables > views and admin kind of >> > > > > > > views(Compactions) >> > > > > > > > > but it also includes 'Metadata lineage' which can help users >> > know >> > > > > how >> > > > > > > this >> > > > > > > > Hudi dataset merged and also > another strong feature is >> creating >> > > > > > > > DeltaStreamer jobs > through the webui and having >> DeltaStreamer >> > > > > > > > > templates(Makes sharing jobs easy). I think those are really >> > > really > >> > > > > > > strong >> > > > > > > > features. >> > > > > > > > >> > > > > > > > > >> > > > > > > > Regards, >> > > > > > > > Taher Koitawala >> > > > > > > > > >> > > > > > > > On Sun, Sep 22, 2019, 9:52 AM Bhavani Sudha > Saktheeswaran >> > > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > +1 for adding web ui. The web ui viz for table > configs >> would >> > be >> > > > > > pretty >> > > > > > > > > useful for > easy debugging. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > > On Sat, Sep 21, 2019 at 7:35 PM Vinoth > Chandar < >> > > > vin...@apache.org> >> > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > +1 will take a look at the doc for specifics > in a few >> days. >> > > > > > > > > > >> > > > > > > > > > On Sat, Sep 21, > 2019 at 7:18 PM vino yang < >> > > > yanghua1...@gmail.com >> > > > > > > >> > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > +1 to > introduce Hudi web UI. Great suggestion! On >> > > 09/21/2019 >> > > > > > > 12:24, >> > > > > > > > > Minh >> > > > > > > > > > > Pham wrote: +1. I > think an admin UI will help with >> > > > reusability >> > > > > > > alot. > >> > > > > > > > On >> > > > > > > > > > > Fri, Sep 20, 2019 at 8:32 PM > Vinay Patil < >> > > > > > vinay18.pa...@gmail.com> >> > > > > > > > > > > wrote: >> > > > > > > > > > > > Hi Taher, > > I really liked this idea, > these >> details >> > > will >> > > > be >> > > >
Re: [DISCUSS] Hudi with Nifi
Sg, lets capture these discussions in the JIRA (link to the discussion thread should suffice) and we can revisit one by one.. On Mon, Sep 23, 2019 at 8:31 PM Taher Koitawala wrote: > Sure Vinoth, I think we need to try this out and check how it fits together > and how deployable it is. > > On Sun, Sep 22, 2019, 7:01 PM Vinoth Chandar wrote: > > > See a lot of Spark Streaming receiver based approach code there, which > > makes me a bit worried about scalability. > > > > Nonetheless. API wise cant we just so dstream.rdd.forEach? And issue > these > > writes using the WriteClient api? > > > > On Sat, Sep 21, 2019 at 4:16 AM Taher Koitawala > > wrote: > > > > > Hi Vinoth, > > > Nifi has the capability to pass data to a custom spark > > job. > > > However that is done through a StreamingContext, not sure if we can > build > > > something on this. I'm trying to wrap my head around how to fit the > > > StreamingContext in our existing code. > > > > > > Here is an example: > > > https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark > > > > > > Regards, > > > Taher Koitawala > > > > > > On Wed, Sep 18, 2019, 8:27 PM Vinoth Chandar > wrote: > > > > > > > Not too familiar wth Nifi myself. Would this still target an use-case > > > like > > > > what pratyaksh mentioned? > > > > For delta streamer specifically, we are moving more and more towards > > > > continuous mode, where > > > > Hudi writing and compaction are amanged by a single long running > spark > > > > application. > > > > > > > > Would Nifi also help us manage compactions when working with Hudi > > > > datasource or just writing plain spark Hudi pipelines? > > > > > > > > On 2019/09/18 08:18:44, Taher Koitawala wrote: > > > > > That's another way of doing things. I want to know if someone wrote > > > > > something like PutParquet. Which directly can write data to Hudi. > > > AFAIK I > > > > > don't think anyone has. > > > > > > > > > > That will really be powerful. > > > > > > > > > > On Wed, Sep 18, 2019, 1:37 PM Pratyaksh Sharma < > > pratyaks...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Taher, > > > > > > > > > > > > In the initial phase of our CDC pipeline, we were using Hudi with > > > Nifi. > > > > > > Nifi was being used to read Binlog file of mysql and to push that > > > data > > > > to > > > > > > some Kafka topic. This topic was then getting consumed by > > > > DeltaStreamer. So > > > > > > Nifi was indirectly involved in that flow. > > > > > > > > > > > > On Wed, Sep 18, 2019 at 10:29 AM Taher Koitawala < > > taher...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi All, > > > > > > > Just wanted to know has anyone tried to write data to > > > Hudi > > > > > > with a > > > > > > > Nifi flow? > > > > > > > > > > > > > > Perhaps may be just a csv file on local to Hudi dataset? If not > > > then > > > > lets > > > > > > > try that! > > > > > > > > > > > > > > Regards, > > > > > > > Taher Koitawala > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: Field not found in record HoodieException
Taher, Sorry I got a bit delayed. I have now put everything you may need in a gist at: https://gist.github.com/smdahmed/3af0e3110e07cf76772bb73d5e9b65e2 (https://link.getmailspring.com/link/930f6985-8e72-4efd-9c97-85965911e...@getmailspring.com/0?redirect=https%3A%2F%2Fgist.github.com%2Fsmdahmed%2F3af0e3110e07cf76772bb73d5e9b65e2=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D) Note that I am still on 0.4.6. So you may need to swap com.uber.hoodie with right org.apache.hudi etc. And I am still on the RDDs based implementation. But I can assure you that if you swap the code with a dataframe based implementation, it will still work same. If you are looking for DataFrame based implementation look at the code sample at: https://github.com/apache/incubator-hudi/issues/859#issuecomment-527316262 (https://link.getmailspring.com/link/930f6985-8e72-4efd-9c97-85965911e...@getmailspring.com/1?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fissues%2F859%23issuecomment-527316262=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D) You will see in my gist at: https://gist.github.com/smdahmed/3af0e3110e07cf76772bb73d5e9b65e2 (https://link.getmailspring.com/link/930f6985-8e72-4efd-9c97-85965911e...@getmailspring.com/2?redirect=https%3A%2F%2Fgist.github.com%2Fsmdahmed%2F3af0e3110e07cf76772bb73d5e9b65e2=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D) the following: Code sample to generate parquet Hive Table creation and addition of partitions Spark Shell based code that is inline with what you had needed. If you want any changes to be made, please do not hesitate. I can modify the code and able to spin tests for you. But I can assure you that this will work and to the best of my belief, this is what you had aimed to achieve. Thanks Kabeer. On Sep 18 2019, at 5:13 pm, Taher Koitawala wrote: > Hi Kabeer, > Really appreciate the help. Take your time nothing urgent. > > Regards, > Taher Koitawala > > On Wed, Sep 18, 2019, 9:38 PM Kabeer Ahmed wrote: > > Taher, > > I have a half baked code for test. I shall complete it and test it and > > revert back to you - latest by weekend. Please bear with me. If it is super > > urgent or you are really stuck, then let me know. > > Thanks, > > On Sep 18 2019, at 7:27 am, Gary Li wrote: > > > I think we can also try to find if there is any illegal character that > > > could mess up Avro scheme in the column. Like a stand alone “/“ or “.” > > > > > > On Tue, Sep 17, 2019 at 8:35 PM Vinoth Chandar > > wrote: > > > > [Orthogonal comment] It's so awesome to see us troubleshooting > > > > > > > together.. > > > > Thanks everyone on this thread! > > > > > > > > On Tue, Sep 17, 2019 at 8:04 PM Taher Koitawala > > > > wrote: > > > > > > > > > No there are no nulls in the data and I am getting the same error. > > > > > On Wed, Sep 18, 2019, 3:33 AM Kabeer Ahmed > > > > > > > > > > > wrote: > > > > > > Taher - did you find any NULLs in the data? If you are still not > > > > > > > > > > > > > > > > able > > > > to > > > > > > make progress, let us know. > > > > > > > > > > > > On Sep 17 2019, at 8:30 am, Taher Koitawala > > > > wrote: > > > > > > > Sure Gary, Let me check if i can find any nulls in there > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 1:28 AM Gary Li < > > yanjia.gary...@gmail.com> > > > > > > wrote: > > > > > > > > Hello, I have seen this exception before. In my case, if the > > > > > > > > > > > > > > > > > > > > > > > > > > precombine key > > > > > > > > of one entry is null, then I will have this error. I'd > > > > > > > > > > > > > > > > > > > > > > > > > > > > > recommend > > > > > > > > > > > > > > > > > > > checking > > > > > > > > if there is any row has null in *last_update.* > > > > > > > > > > > > > > > > Best, > > > > > > > > Gary > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 16, 2019 at 12:32 PM Kabeer Ahmed < > > > > kab...@linuxmail.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Taher, > > > > > > > > > Let me spin a test for you to test similar scenario and let > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > me > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > revert > > > > > > > > back > > > > > > > > > to you. > > > > > > > > > On Sep 16 2019, at 2:09 pm, Taher Koitawala < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > taher...@gmail.com> > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > Hi Kabeer, hive table has everything as a string. However > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > fetching > > > > > > > > > > data, the spark query is > > > > > > > > > > .sql(String.format("select > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > contact_id,country,cast(last_update > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > as > > > > > > > > > >