[DISCUSSION] optimization of OrderBy sorted columns + Limit Query

马云 Sat, 23 Sep 2017 09:26:44 -0700

Hi Carbon Dev,

We faced slow sqls when apply carbondata in business system.

In our custom info management system, only 2 million data in carbondata.

we need to fetch data sorting by the costum’s phone number field and paginate
data on the web page.

For example, when loading our system index web page, it will execute the below
similar sql

Select phone_number, other_field1, other_field2 , other_field3 , other_field4
from custominfo order by phone_number limit 10;

In our prod env, it takes about 4 seconds to execute this sql, it is slow for
only 2 million data system.

In another car info management system, it has 0.1 billion data in carbondata.
need to fetch data by sorting the car id field and business date, and paginate
data on the web page.

Similar sql as below

Select car_id, business_date, other_field1, other_field2, other_field3,
other_field4 from carinfo order by car_info, business_date limit 10;

I done test in my local , it takes about 30 seconds to execute the sql.

In short, the more data, the worse performance even we just need the top 10
since it used full scan for orderby operation

Actually a few months ago, I have come up with the optimization plan for
orderby+limit,
Please refer to

http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860

<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Optimize-Order-By-Limit-Query-td9764.html#a9860>
mainly, pushdown order by +limit to carbon scan, leverage the sort column’s
order stored feature to get topN sorted data in each block

and reduce io scan. From test result, it can improve about 80% performance.

In previous discussion, we have the below conclusion.

The optimization solution has below limitations

1. It cannot work for dictionary columns. As there is no guarantee that
dictionary allocation is in sorted order.

2. It only can work when order by one dimension + limit or order by
prefix columns of MDK

3. It can’t optimize when orderby measures +limit

This time I think we can optimize the below cases, please give your suggestion

1. Order by prefix columns of sort columns if no dictionary encode + limit

2. Order by a dimension with no dictionary encode + limit

for example date, string type column

3. Order by a dimension with dictionary encode if creating table with a
pre-defined dictionaries + limit

4. Order by a number field which is added to sort colums + limit

for example int, bitint type column

5. Order by dimension with dictionary encode + limit
In this case, there is no guarantee that
dictionary allocation is in sorted order.

So maybe we can calculate the dictionary’s order according to the original
value’s order in memory

And get top N according to the original value’s order

For example in table t3 having a country field with dictionary encode,

Country dictionary

Australia 6

Canada 2

China 3

Uk 4

Usa 5

In Blocklet 1 dict value from 2 to 3 and have 32000 data

In Blocklet 2 dict value from 3 to 5 and have 32000 data

In Blocklet 3 dict value from 6 to 6 and have 32000 data

for the below Sql, apparently we only need to process blocklet3, it can reduce
process time for blocklet1 and blocklet2

Select country , name from t3 order by country limit 100;

Thanks

Jarck

[DISCUSSION] optimization of OrderBy sorted columns + Limit Query

Reply via email to