Hi Carbon Dev,

We faced slow sqls when apply carbondata in business system.

In our custom info management system, only 2 million data in carbondata.

we need to fetch data  sorting by the costum’s  phone number field and paginate 
data on the web page.

For example, when loading our system index  web page, it will execute the below 
similar sql

Select phone_number, other_field1, other_field2 , other_field3 , other_field4 
from custominfo order by phone_number limit 10;

In our prod env, it takes about 4 seconds to execute this sql,  it is slow for 
only 2 million data system.


In another car info management system, it has  0.1 billion data in carbondata.
need to fetch data by sorting the car id field and business date,  and paginate 
data on the web page.

Similar sql as below

Select car_id,  business_date, other_field1, other_field2, other_field3, 
other_field4 from carinfo order by car_info, business_date limit 10;

I done test in my local , it takes  about 30 seconds to execute the sql. 

In short, the more data, the worse performance even we just need the top 10 
since it used full scan for orderby operation 


Actually a few months ago, I have come up with the optimization plan for 
orderby+limit,
Please refer to

http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860
 
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860>
mainly, pushdown order by +limit to carbon scan, leverage  the sort column’s 
order stored feature to get topN sorted data in each block  

and reduce io scan. From test result, it can improve about 80% performance. 

In previous  discussion, we have the  below conclusion.

The  optimization solution has  below limitations

1.       It cannot work for dictionary columns. As there is no guarantee that 
dictionary allocation is in sorted order. 

2.       It only can work when order by one dimension + limit or order by 
prefix columns of MDK

3.      It can’t optimize when orderby measures +limit

 
This time I think we can optimize the below cases, please give your suggestion

1.      Order by prefix columns of  sort columns if no dictionary encode + limit

2.      Order by a dimension with no dictionary encode + limit

for example  date, string type column

3.      Order by a dimension with dictionary encode if creating table with a 
pre-defined dictionaries + limit

4.      Order by a number field which is added to sort colums + limit

for example  int, bitint type column

 
5.      Order by dimension with dictionary encode + limit
       In this case, there is no guarantee that 
dictionary allocation is in sorted order. 

So maybe we can calculate the dictionary’s order according to the original 
value’s order in memory

And get top N according to the original value’s order

For example in table t3 having a country field with dictionary encode,

Country           dictionary

Australia           6

Canada             2

China                3

Uk                    4

Usa                   5

In Blocklet 1 dict value from 2 to 3  and have 32000 data

In Blocklet 2 dict value from 3 to 5  and have 32000 data

In Blocklet 3 dict value from 6 to 6  and have 32000 data 

for the below Sql, apparently we only need to process blocklet3, it can reduce 
process time for blocklet1 and blocklet2

Select  country , name from t3 order by country limit 100;

 
 
Thanks

 Jarck

Reply via email to