Hi, Problem : The first-time query of carbon becomes very slow. It is because of reading many small carbonindex files and cache to the driver at the first time. Many carbonindex files are created in two cases Case 1: Loading data in large cluster For example, if the cluster size is 100 nodes then for each load 100 index files are created per segment. So after 100 loads, the number of carbonindex files becomes 10000. Case 2: Frequent loads For example, if the load happens for every 5 minutes in 4 node cluster, it will be more than 10000 index files after 10 days even in 4 node cluster.
It will be slower to read all the files from the driver since a lot of namenode calls and IO operations. Solution : Merge the carbonindex files in two levels.so that we can reduce the IO calls to namenode and improves the read performance. Level 1: Merge within a segment. Merge the carbonindex files to single file immediately after load completes within the segment. It would be named as a .carbonindexmerge file. It is actually not a true data merging but a simple file merge. So that the current structure of carbonindex files does not change. While reading we just read one file instead of many carbonindex files within the segment. Level 2: Merge across segments. Merge the already merged carbonindex files of each segment would be merged after a configurable number of segments reached. These files are placed under the metadata folder of the table.And the information of these merged carbonindex files will be updated in the table status file. While reading the carbonindex files first we check the tablestatus for the availability of the merged file and read using the information available in it. For example, the configurable number to merge index files across segments are 100 then for every 100 segments one new merged index file will be created under metadata folder and the tablestatus of these 100 segments are updated with the information of this file. This file is not updatable and it would be removed only if all the segments of this merged index file is removed. This file also a simple file merge not an actual data merge. By default this is disabled and the user can enable it from the carbon properties. And also there is an issue in driver cache for old segments.It would be not necessary to cache the old segments if the queries are not interested in them.I will start another discussion for this cache issue. -- Thanks & Regards Ravindra