1. there are some user want to use filter and have a plan to use in next few month. except loading dataMap, we also should avoid delete detaMap and check schema folder in SDK
2. it's nice. When I call hasNext in SDK, sometimes it need long time to read, default value of batch only 100 rows。 How many size columnar batch want to support? can columnar batch support one block?because sometimes we want to read all data。 3. It will improve performance in multiple cores machine can we keep the sequence when we use multiple thread to read in different machine? ------------------ Original ------------------ From: "Kunal Kapoor";<kunalkapoor...@gmail.com>; Send time: Monday, Oct 29, 2018 3:03 AM To: "dev"<dev@carbondata.apache.org>; Subject: [Discussion] CarbonReader performance improvement Hi All, I would like to propose some improvements to CarbonReader implementation to increase the performance. 1. When filter expression is not provided by the user then instead of calling getSplits method we can list the carbondata files and treat one file as one split. This would improve the performance as the time in loading block/blocklet datamap would be avoided. 2. Implement Vectorized Reader and expose a API for the user to switch between CarbonReader/Vectorized reader. Additionally an API would be provided for the user to extract the columnar batch instead of rows. This would allow the user to have a deeper integration with carbon. Additionally the reduction in method calls for vector reader would improve the read time. 3. Add concurrent reading functionality to Carbon Reader. This can be enabled by passing the number of splits required by the user. If the user passes 2 as the split for reader then the user would be returned 2 CarbonReaders with equal number of RecordReaders in each. The user can then run each CarbonReader instance in a separate thread to read the data concurrently. The performance report would be shared soon. Any suggestion from the community is greatly appreciated. Thanks Kunal Kapoor