Re: Re:Re: Re: Optimize Order By + Limit Query

2017-03-30 Thread Lu Cao
@Liang, Yes, actually I'm currently working on the limit query
optimization.
I get the limited dictionary value and convert to the filter condition in
CarbonOptimizer step.
It would definitely improve the query performance in some scenario.

On Thu, Mar 30, 2017 at 2:07 PM, Liang Chen  wrote:

> Hi
>
> +1 for simafengyun's optimization, it looks good to me.
>
> I propose to do "limit" pushdown first, similar with filter pushdown. what
> is your opionion? @simafengyun
>
> For "order by" pushdown, let us work out an ideal solution to consider all
> aggregation push down cases. Ravindara's comment is reasonable, we need to
> consider decoupling spark and carbondata, otherwise maintenance cost might
> be high if do computing works at both side, because we need to keep
> utilizing Spark' computing capability along with its version evolution.
>
> Regards
> Liang
>
>
> simafengyun wrote
> > Hi Ravindran,
> > yes, use carbon do the sorting if the order by column is not first
> > column.But its sorting is very high since the dimension data in blocklet
> > is stored after sorting.So in carbon can use  merge sort  + topN to get N
> > data from each block.In addition,  the biggest difference is that it can
> > reduce disk IO since can use limit n to reduce required blocklets.if you
> > only apply spark's top N, I don't think you can make  suck below
> > performance. That's impossible  if don't reduce disk IO.
> >
>  n5.nabble.com/file/n9834/%E6%9C%AA%E5%91%BD%E5%90%8D2.jpg>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > At 2017-03-30 03:12:54, "Ravindra Pesala" ravi.pes...@gmail.com
> > wrote:
> >>Hi,
> >>
> >>You mean Carbon do the sorting if the order by column is not first column
> >>and provide only limit values to spark. But the same job spark is also
> >>doing it just sorts the partition and gets the top values out of it. You
> >>can reduce the table_blocksize to get the better sort performance as
> spark
> >>try to do sorting inside memory.
> >>
> >>I can see we can do some optimizations in integration layer itself with
> out
> >>pushing down any logic to carbon like if the order by column is first
> >>column then we can just get limit values with out sorting any data.
> >>
> >>Regards,
> >>Ravindra.
> >>
> >>On 29 March 2017 at 08:58, 马云 simafengyun1...@163.com wrote:
> >>
> >>> Hi Ravindran,
> >>> Thanks for your quick response. please see my answer as below
> >>> 
> >>>  What if the order by column is not the first column? It needs to scan
> >>> all
> >>> blocklets to get the data out of it if the order by column is not first
> >>> column of mdk
> >>> 
> >>> Answer :  if step2 doesn't filter any blocklet, you are right,It needs
> >>> to
> >>> scan all blocklets to get the data out of it if the order by column is
> >>> not
> >>> first column of mdk
> >>> but it just scan all the order by column's data, for
> >>> others columns data,  use the lazy-load strategy and  it can reduce
> scan
> >>> accordingly to  limit value.
> >>> Hence you can see the performance is much better now
> >>> after  my optimization. Currently the carbondata order by + limit
> >>> performance is very bad since it scans all data.
> >>>in my test there are  20,000,000 data, it takes more
> than
> >>> 10s, if data is much more huge,  I think it is hard for user to stand
> >>> such
> >>> bad performance when they do order by + limit  query?
> >>>
> >>>
> >>> 
> >>>  We used to have multiple push down optimizations from spark to carbon
> >>> like aggregation, limit, topn etc. But later it was removed because it
> >>> is
> >>> very hard to maintain for version to version. I feel it is better that
> >>> execution engine like spark can do these type of operations.
> >>> 
> >>> Answer : In my opinion, I don't think "hard to maintain for version to
> >>> version" is a good reason to give up the order by  + limit
> optimization.
> >>> I think it can create new class to extends current and try to reduce
> the
> >>> impact for the current code. Maybe can make it is easy to maintain.
> >>> Maybe I am wrong.
> >>>
> >>>
> >>>
> >>>
> >>> At 2017-03-29 02:21:58, "Ravindra Pesala" ravi.pes...@gmail.com
> 
> >>> wrote:
> >>>
> >>>
> >>> Hi Jarck Ma,
> >>>
> >>> It is great to try optimizing Carbondata.
> >>> I think this solution comes up with many limitations. What if the order
> >>> by
> >>> column is not the first column? It needs to scan all blocklets to get
> >>> the
> >>> data out of it if the order by column is not first column of mdk.
> >>>
> >>> We used to have multiple push down optimizations from spark to carbon
> >>> like
> >>> aggregation, limit, topn etc. But later it was removed because it is
> >>> very
> >>> 

Re: Re:Re: Re: Optimize Order By + Limit Query

2017-03-30 Thread Liang Chen
Hi

+1 for simafengyun's optimization, it looks good to me.

I propose to do "limit" pushdown first, similar with filter pushdown. what
is your opionion? @simafengyun

For "order by" pushdown, let us work out an ideal solution to consider all
aggregation push down cases. Ravindara's comment is reasonable, we need to
consider decoupling spark and carbondata, otherwise maintenance cost might
be high if do computing works at both side, because we need to keep
utilizing Spark' computing capability along with its version evolution.

Regards
Liang


simafengyun wrote
> Hi Ravindran,
> yes, use carbon do the sorting if the order by column is not first
> column.But its sorting is very high since the dimension data in blocklet
> is stored after sorting.So in carbon can use  merge sort  + topN to get N
> data from each block.In addition,  the biggest difference is that it can
> reduce disk IO since can use limit n to reduce required blocklets.if you
> only apply spark's top N, I don't think you can make  suck below
> performance. That's impossible  if don't reduce disk IO.
> 

 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> At 2017-03-30 03:12:54, "Ravindra Pesala" ravi.pes...@gmail.com
> wrote:
>>Hi,
>>
>>You mean Carbon do the sorting if the order by column is not first column
>>and provide only limit values to spark. But the same job spark is also
>>doing it just sorts the partition and gets the top values out of it. You
>>can reduce the table_blocksize to get the better sort performance as spark
>>try to do sorting inside memory.
>>
>>I can see we can do some optimizations in integration layer itself with
out
>>pushing down any logic to carbon like if the order by column is first
>>column then we can just get limit values with out sorting any data.
>>
>>Regards,
>>Ravindra.
>>
>>On 29 March 2017 at 08:58, 马云 simafengyun1...@163.com wrote:
>>
>>> Hi Ravindran,
>>> Thanks for your quick response. please see my answer as below
>>> 
>>>  What if the order by column is not the first column? It needs to scan
>>> all
>>> blocklets to get the data out of it if the order by column is not first
>>> column of mdk
>>> 
>>> Answer :  if step2 doesn't filter any blocklet, you are right,It needs
>>> to
>>> scan all blocklets to get the data out of it if the order by column is
>>> not
>>> first column of mdk
>>> but it just scan all the order by column's data, for
>>> others columns data,  use the lazy-load strategy and  it can reduce scan
>>> accordingly to  limit value.
>>> Hence you can see the performance is much better now
>>> after  my optimization. Currently the carbondata order by + limit
>>> performance is very bad since it scans all data.
>>>in my test there are  20,000,000 data, it takes more than
>>> 10s, if data is much more huge,  I think it is hard for user to stand
>>> such
>>> bad performance when they do order by + limit  query?
>>>
>>>
>>> 
>>>  We used to have multiple push down optimizations from spark to carbon
>>> like aggregation, limit, topn etc. But later it was removed because it
>>> is
>>> very hard to maintain for version to version. I feel it is better that
>>> execution engine like spark can do these type of operations.
>>> 
>>> Answer : In my opinion, I don't think "hard to maintain for version to
>>> version" is a good reason to give up the order by  + limit optimization.
>>> I think it can create new class to extends current and try to reduce the
>>> impact for the current code. Maybe can make it is easy to maintain.
>>> Maybe I am wrong.
>>>
>>>
>>>
>>>
>>> At 2017-03-29 02:21:58, "Ravindra Pesala" ravi.pes...@gmail.com
>>> wrote:
>>>
>>>
>>> Hi Jarck Ma,
>>>
>>> It is great to try optimizing Carbondata.
>>> I think this solution comes up with many limitations. What if the order
>>> by
>>> column is not the first column? It needs to scan all blocklets to get
>>> the
>>> data out of it if the order by column is not first column of mdk.
>>>
>>> We used to have multiple push down optimizations from spark to carbon
>>> like
>>> aggregation, limit, topn etc. But later it was removed because it is
>>> very
>>> hard to maintain for version to version. I feel it is better that
>>> execution
>>> engine like spark can do these type of operations.
>>>
>>>
>>> Regards,
>>> Ravindra.
>>>
>>>
>>>
>>> On Tue, Mar 28, 2017, 14:28 马云 simafengyun1...@163.com wrote:
>>>
>>>
>>> Hi Carbon Dev,
>>>
>>> Currently I have done optimization for ordering by 1 dimension.
>>>
>>> my local performance test as below. Please give your suggestion.
>>>
>>>
>>>
>>>
>>> | data count | test sql | limit value in sql | performance(ms) |
>>> | optimized code | original code |
>>> | 

Re:Re: Re: Optimize Order By + Limit Query

2017-03-29 Thread 马云
Hi Ravindran,yes, use carbon do the sorting if the order by column is not first 
column.But its sorting is very high since the dimension data in blocklet is 
stored after sorting.So in carbon can use  merge sort  + topN to get N data 
from each block.In addition,  the biggest difference is that it can reduce disk 
IO since can use limit n to reduce required blocklets.if you only apply spark's 
top N, I don't think you can make  suck below performance. That's impossible  
if don't reduce disk IO.












At 2017-03-30 03:12:54, "Ravindra Pesala"  wrote:
>Hi,
>
>You mean Carbon do the sorting if the order by column is not first column
>and provide only limit values to spark. But the same job spark is also
>doing it just sorts the partition and gets the top values out of it. You
>can reduce the table_blocksize to get the better sort performance as spark
>try to do sorting inside memory.
>
>I can see we can do some optimizations in integration layer itself with out
>pushing down any logic to carbon like if the order by column is first
>column then we can just get limit values with out sorting any data.
>
>Regards,
>Ravindra.
>
>On 29 March 2017 at 08:58, 马云  wrote:
>
>> Hi Ravindran,
>> Thanks for your quick response. please see my answer as below
>> 
>>  What if the order by column is not the first column? It needs to scan all
>> blocklets to get the data out of it if the order by column is not first
>> column of mdk
>> 
>> Answer :  if step2 doesn't filter any blocklet, you are right,It needs to
>> scan all blocklets to get the data out of it if the order by column is not
>> first column of mdk
>> but it just scan all the order by column's data, for
>> others columns data,  use the lazy-load strategy and  it can reduce scan
>> accordingly to  limit value.
>> Hence you can see the performance is much better now
>> after  my optimization. Currently the carbondata order by + limit
>> performance is very bad since it scans all data.
>>in my test there are  20,000,000 data, it takes more than
>> 10s, if data is much more huge,  I think it is hard for user to stand such
>> bad performance when they do order by + limit  query?
>>
>>
>> 
>>  We used to have multiple push down optimizations from spark to carbon
>> like aggregation, limit, topn etc. But later it was removed because it is
>> very hard to maintain for version to version. I feel it is better that
>> execution engine like spark can do these type of operations.
>> 
>> Answer : In my opinion, I don't think "hard to maintain for version to
>> version" is a good reason to give up the order by  + limit optimization.
>> I think it can create new class to extends current and try to reduce the
>> impact for the current code. Maybe can make it is easy to maintain.
>> Maybe I am wrong.
>>
>>
>>
>>
>> At 2017-03-29 02:21:58, "Ravindra Pesala"  wrote:
>>
>>
>> Hi Jarck Ma,
>>
>> It is great to try optimizing Carbondata.
>> I think this solution comes up with many limitations. What if the order by
>> column is not the first column? It needs to scan all blocklets to get the
>> data out of it if the order by column is not first column of mdk.
>>
>> We used to have multiple push down optimizations from spark to carbon like
>> aggregation, limit, topn etc. But later it was removed because it is very
>> hard to maintain for version to version. I feel it is better that execution
>> engine like spark can do these type of operations.
>>
>>
>> Regards,
>> Ravindra.
>>
>>
>>
>> On Tue, Mar 28, 2017, 14:28 马云  wrote:
>>
>>
>> Hi Carbon Dev,
>>
>> Currently I have done optimization for ordering by 1 dimension.
>>
>> my local performance test as below. Please give your suggestion.
>>
>>
>>
>>
>> | data count | test sql | limit value in sql | performance(ms) |
>> | optimized code | original code |
>> | 20,000,000 | SELECT name, serialname, country, salary, id, date FROM t3
>> ORDER BY country limit 1000 | 1000 | 677 | 10906 |
>> | SELECT name, serialname, country, salary, id, date FROM t3 ORDER BY
>> serialname limit 1 | 1 | 1897 | 12108 |
>> | SELECT name, serialname, country, salary, id, date FROM t3 ORDER BY
>> serialname limit 5 | 5 | 2814 | 14279 |
>>
>> my optimization solution for order by 1 dimension + limit as below
>>
>> mainly filter some unnecessary blocklets and leverage  the dimension's
>> order stored feature to get sorted data in each partition.
>>
>> at last use the TakeOrderedAndProject to merge sorted data from partitions
>>
>> step1. change logical plan and push down the order by and limit
>> information to carbon scan
>>
>> and change sort physical plan to TakeOrderedAndProject