Re: [Discussion] Roadmap for Apache CarbonData 2
Hi, so glad to see Carbondata will enter stage 2.x and I have the following suggestions for your consideration as following: 1. Evolution for Carbondata file format. Previously I thought one of the key highlights of Carbondata is the Carbondata file format, is there any evolution for that? While Carbondata steps to a broader application scopes, will the current file format still suite well for them? 2. Performance commitment of Carbondata. Seems that Carbondata cares more about expanding the scope of application than the performance enhancemance. What is the performance commitment of Carbondata 2 for dataloading? Many enterprises do have big data, but that is not BIG enough to use cloud/datalake etc. For these scenarios, is Carbondata performance obviously better than other fileFormat+executionEngine combination? Do we have any plan for the enhancement? 3. Smarter Carbondata. As we suggested earlier, is Carbondata advisor on the roadmap? Carbondata has many features, but I notice that part of them are never used by the user. While Carbondata will serve AI scope, can itself be smarter as well? The Carbondata advisor is a DBA for Carbondata which will monitor the workload, usage, current performance and give proper suggestions or even can do proper operation itself. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: [Discussion] Roadmap for Apache CarbonData 2
Hi Team Its glad to see how Carbondata has grown and become popular over the time. It was important to re-look and come up with a roadmap as per future needs. Carbondata 2.0 proposal looks good as we are trying to align it with Cloud which will be more or less the prominent run time environment in the near future. A lot of code refactoring will be required as per the roadmap. I would like to add a couple of points. 1. Complex type support: Although we do have complex type support there is scope for improvement. use cases for nested columns are growing extensively. We should work on improving the storage of nested columns and should also support creating compound/multi column indexes for the nested columns. 2. Feature code segregation and Pluggability: Current code is tightly coupled. The ideal case would be to have a base and make all the features pluggable into it but that will be hard to achieve. We can try segregation at the package level for major features but for any new feature developed we should think in terms of pluggability. [Clarification] Carbon UI: I did not understand the usage of Carbon segment management UI. For cloud scenario we will have to expose rest end points which will make carbon more like a Microservice and that does not go along with Carbondata use case. UI/tool makes more sense for internal testing but not sure how it will be beneficial for end user. May be a tool showing the data stored in each table would be more useful to the end user. Regards Manish Gupta On Tue, Aug 13, 2019 at 4:51 PM Kumar Vishal wrote: > Hi Ravi, > > We can add below requirements in 2.0: > > 1. Data Loading performance improvement.(Need to analyze and improve) > 2. Unify reading for carbon data file, currently data is read in two parts > dimension and measure because of this number of IO is more. > 3. Carbon Store size optimization(Already PR is raised need to revisit) and > we can explore some more optimization(like RLE hybrid Bit Packing). > 4. Presto enhancement(Like write support, Presto SQL adaptation, Complex > type read support) > 5. Spark Data Source V2 integration. > 6. Spatial Index Support. > > > -Regards > Kumar Vishal > > On Thu, Jul 18, 2019 at 8:20 PM Ravindra Pesala > wrote: > > > Hi Kevin, > > > > Yes, we can improve it. The implementation is closely related to > supporting > > pre-aggregate datamaps on the streaming table which we have already > > implemented some time ago. And same will be reimplemented for MV datamap > > soon as well. > > The implementation allows using of pre-aggregate datamap for > non-streaming > > segments and main table for streaming segments. We update the query plan > to > > do union on both the tables and query only the streaming segments for > main > > table. > > So even in our case also we can use the same way, we can do the union > query > > of MV table and main table(only non loaded datamap segments) and execute > > the query. We can definitely consider after we support streaming table > for > > MV datamap. > > > > Regards, > > Ravindra. > > > > On Wed, 17 Jul 2019 at 07:55, kevinjmh wrote: > > > > > currently, datamap in carbon applys to all segments. > > > The roadmap refers to commands like add/drop segment, and also maybe > > > something > > > about incremental loading for MV. For these scenes, it is better to > make > > > datamap can be use on segment level instead of disable the datamap when > > any > > > datamap data is not ready for any segment. Also this can make datamap > > > fail-safe and enhance carbon's stablility. > > > Maybe we can consider about this also. > > > > > > > > > > > > > > > - > > > Regards > > > Manhua > > > > > > > > > > > > ---Original--- > > > From: "Ravindra Pesala" > > > Date: Tue, Jul 16, 2019 22:31 PM > > > To: "dev"; > > > Subject: [Discussion] Roadmap for Apache CarbonData 2 > > > > > > > > > Hi Community, > > > > > > Three years have passed since the launching of the Apache CarbonData > > > project, CarbonData has become a popular data management solution for > > > various scenarios. As new workload like AI and new runtime environment > > like > > > the cloud is emerging quickly, I think we are reaching a point that > needs > > > to discuss the future of CarbonData. > > > > > > To bring CarbonData to a new level to satisfy those new requirements, > > Jacky > > > and I drafted a roadmap for CarbonData 2 in the cwiki website. > > > - English Version: > > > > > > > > > https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal > > > - Chinese Version: > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492 > > > > > > Please feel free to discuss the roadmap in this thread, and we welcome > > > every feedback to make CarbonData better. > > > > > > Thanks and Regards, > > > Ravindra. > > > > > > > > -- > > Thanks & Regards, > > Ravi > > >
Re: [Discussion] Roadmap for Apache CarbonData 2
Hi Ravi, We can add below requirements in 2.0: 1. Data Loading performance improvement.(Need to analyze and improve) 2. Unify reading for carbon data file, currently data is read in two parts dimension and measure because of this number of IO is more. 3. Carbon Store size optimization(Already PR is raised need to revisit) and we can explore some more optimization(like RLE hybrid Bit Packing). 4. Presto enhancement(Like write support, Presto SQL adaptation, Complex type read support) 5. Spark Data Source V2 integration. 6. Spatial Index Support. -Regards Kumar Vishal On Thu, Jul 18, 2019 at 8:20 PM Ravindra Pesala wrote: > Hi Kevin, > > Yes, we can improve it. The implementation is closely related to supporting > pre-aggregate datamaps on the streaming table which we have already > implemented some time ago. And same will be reimplemented for MV datamap > soon as well. > The implementation allows using of pre-aggregate datamap for non-streaming > segments and main table for streaming segments. We update the query plan to > do union on both the tables and query only the streaming segments for main > table. > So even in our case also we can use the same way, we can do the union query > of MV table and main table(only non loaded datamap segments) and execute > the query. We can definitely consider after we support streaming table for > MV datamap. > > Regards, > Ravindra. > > On Wed, 17 Jul 2019 at 07:55, kevinjmh wrote: > > > currently, datamap in carbon applys to all segments. > > The roadmap refers to commands like add/drop segment, and also maybe > > something > > about incremental loading for MV. For these scenes, it is better to make > > datamap can be use on segment level instead of disable the datamap when > any > > datamap data is not ready for any segment. Also this can make datamap > > fail-safe and enhance carbon's stablility. > > Maybe we can consider about this also. > > > > > > > > > > - > > Regards > > Manhua > > > > > > > > ---Original--- > > From: "Ravindra Pesala" > > Date: Tue, Jul 16, 2019 22:31 PM > > To: "dev"; > > Subject: [Discussion] Roadmap for Apache CarbonData 2 > > > > > > Hi Community, > > > > Three years have passed since the launching of the Apache CarbonData > > project, CarbonData has become a popular data management solution for > > various scenarios. As new workload like AI and new runtime environment > like > > the cloud is emerging quickly, I think we are reaching a point that needs > > to discuss the future of CarbonData. > > > > To bring CarbonData to a new level to satisfy those new requirements, > Jacky > > and I drafted a roadmap for CarbonData 2 in the cwiki website. > > - English Version: > > > > > https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal > > - Chinese Version: > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492 > > > > Please feel free to discuss the roadmap in this thread, and we welcome > > every feedback to make CarbonData better. > > > > Thanks and Regards, > > Ravindra. > > > > -- > Thanks & Regards, > Ravi >
Re: [Discussion] Roadmap for Apache CarbonData 2
Hi Kevin, Yes, we can improve it. The implementation is closely related to supporting pre-aggregate datamaps on the streaming table which we have already implemented some time ago. And same will be reimplemented for MV datamap soon as well. The implementation allows using of pre-aggregate datamap for non-streaming segments and main table for streaming segments. We update the query plan to do union on both the tables and query only the streaming segments for main table. So even in our case also we can use the same way, we can do the union query of MV table and main table(only non loaded datamap segments) and execute the query. We can definitely consider after we support streaming table for MV datamap. Regards, Ravindra. On Wed, 17 Jul 2019 at 07:55, kevinjmh wrote: > currently, datamap in carbon applys to all segments. > The roadmap refers to commands like add/drop segment, and also maybe > something > about incremental loading for MV. For these scenes, it is better to make > datamap can be use on segment level instead of disable the datamap when any > datamap data is not ready for any segment. Also this can make datamap > fail-safe and enhance carbon's stablility. > Maybe we can consider about this also. > > > > > - > Regards > Manhua > > > > ---Original--- > From: "Ravindra Pesala" > Date: Tue, Jul 16, 2019 22:31 PM > To: "dev"; > Subject: [Discussion] Roadmap for Apache CarbonData 2 > > > Hi Community, > > Three years have passed since the launching of the Apache CarbonData > project, CarbonData has become a popular data management solution for > various scenarios. As new workload like AI and new runtime environment like > the cloud is emerging quickly, I think we are reaching a point that needs > to discuss the future of CarbonData. > > To bring CarbonData to a new level to satisfy those new requirements, Jacky > and I drafted a roadmap for CarbonData 2 in the cwiki website. > - English Version: > > https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal > - Chinese Version: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492 > > Please feel free to discuss the roadmap in this thread, and we welcome > every feedback to make CarbonData better. > > Thanks and Regards, > Ravindra. -- Thanks & Regards, Ravi
Re: [Discussion] Roadmap for Apache CarbonData 2
currently, datamap in carbon applys to all segments. The roadmap refers to commands like add/drop segment, and also maybe something about incremental loading for MV. For these scenes, it is better to make datamap can be use on segment level instead of disable the datamap when any datamap data is not ready for any segment. Also this can make datamap fail-safe and enhance carbon's stablility. Maybe we can consider about this also. - Regards Manhua ---Original--- From: "Ravindra Pesala" Date: Tue, Jul 16, 2019 22:31 PM To: "dev"; Subject: [Discussion] Roadmap for Apache CarbonData 2 Hi Community, Three years have passed since the launching of the Apache CarbonData project, CarbonData has become a popular data management solution for various scenarios. As new workload like AI and new runtime environment like the cloud is emerging quickly, I think we are reaching a point that needs to discuss the future of CarbonData. To bring CarbonData to a new level to satisfy those new requirements, Jacky and I drafted a roadmap for CarbonData 2 in the cwiki website. - English Version: https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal - Chinese Version: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492 Please feel free to discuss the roadmap in this thread, and we welcome every feedback to make CarbonData better. Thanks and Regards, Ravindra.