[jira] [Updated] (CARBONDATA-754) order by query's performance is very bad

Jarck (JIRA) Thu, 09 Mar 2017 18:42:59 -0800

     [ 
https://issues.apache.org/jira/browse/CARBONDATA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jarck updated CARBONDATA-754:
-----------------------------
    Request participants:   (was: )
             Description: 
currently the order by dimension query's performance is very bad if there is no 
filter or filtered data is still to large. 
if I was not  wrong, it read all  related data in carbon scan physical level,  
decode the sort dimension's data  and sort all of them in spark sql sort 
physical  plan.

I think we can optimize as below:

1. push down sort (+limit) to carbon scan 

2. leverage the dimension's stored by nature order feature in blocklet level to 
get a sorted data in each partition

3. implements merge-sort/TopN in the spark's sort physical plan

actually I haveI optimized for  "order by only 1 dimension + limit" base on 
branch 0.2. The performance is much better.
sort by 1 dimension +limit 10000  in 100 million data , it only take less than 
1 second to get  and print the result.






  was:
currently the order by dimension query's performance is very bad if there is no 
filter or filtered data is still to large. 
if I was not  wrong, it read all  related data in carbon scan physical level,  
decode the sort dimension's data  and sort all of them in spark sql sort 
physical  plan.

I think we can optimize as below:

1. push down sort (+limit) to carbon scan 

2. leverage the dimension's stored by nature order feature in blocklet level to 
get a sorted data in each partition

3. implements merge-sort/TopN in the spark's sort physical plan

actually I haveI optimized for  "order by only 1 dimension + limit" base on 
branch 0.2. The performance is much better.
sort by 1 dimension +limit 10000  in 100 million data , it only take less than 
1 second to get  and print the result.





1. push down





> order by query's performance is very bad
> ----------------------------------------
>
>                 Key: CARBONDATA-754
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-754
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: core, spark-integration
>            Reporter: Jarck
>            Assignee: Jarck
>
> currently the order by dimension query's performance is very bad if there is 
> no filter or filtered data is still to large. 
> if I was not  wrong, it read all  related data in carbon scan physical level, 
>  decode the sort dimension's data  and sort all of them in spark sql sort 
> physical  plan.
> I think we can optimize as below:
> 1. push down sort (+limit) to carbon scan 
> 2. leverage the dimension's stored by nature order feature in blocklet level 
> to get a sorted data in each partition
> 3. implements merge-sort/TopN in the spark's sort physical plan
> actually I haveI optimized for  "order by only 1 dimension + limit" base on 
> branch 0.2. The performance is much better.
> sort by 1 dimension +limit 10000  in 100 million data , it only take less 
> than 1 second to get  and print the result.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (CARBONDATA-754) order by query's performance is very bad

Reply via email to