[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-11 Thread GitBox


sundy-li edited a comment on pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#issuecomment-797294233


   Hi, all.  From other user's advice, partial_sort could be 
   ```
   pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
   where
   F: FnMut(, ) -> Ordering,
   {
   let (before, _mid, _after) = v.select_nth_unstable_by(limit,  
is_less);
   before.sort_unstable_by(is_less);
   }
   ```
   
   So I abandon to use of custom partial_sort. This could fix the issue and 
risk we are arguing about.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-06 Thread GitBox


sundy-li edited a comment on pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#issuecomment-792216219


   > For example, using a priority queue to keep only the top k values in 
memory.
   
   Yes, but lots of codes may duplicate with sort kernel. partial_sort used 
priority queue inside. It maybe good to do sorting in pipeline OLAP systems 
   
   In ClickHouse,  PartialSortingTransform(Each block in each thread) --> 
MergeSortingTransform(Blocks to one block in each thread) --> 
MergingSortedTransform(N Block in N Thread to one block) . 
   
   ```
   ┌─explain┐
   │ (Expression)   │
   │ ExpressionTransform│
   │   (Limit)  │
   │   Limit│
   │ (MergingSorted)│
   │ MergingSortedTransform 16 → 1  │
   │   (MergeSorting)   │
   │   MergeSortingTransform × 16   │
   │ (PartialSorting)   │
   │ LimitsCheckingTransform × 16   │
   │   PartialSortingTransform × 16 │
   │ (Expression)   │
   │ ExpressionTransform × 16   │
   │   (SettingQuotaAndLimits)  │
   │ (ReadFromStorage)  │
   │ NumbersMt × 16 0 → 1   │
   └┘
   ```
   
   > I just feel that the sheer amount of uncommented unsafe on it is an 
indication that there is some opportunity to revisit and further improve the 
crate, thereby reducing the risks of undefined behavior (UB).
   
   
   @alamb @jorgecarleitao  Thanks for all your reviews. I also have the 
consideration about unsafe codes in partial_sort may break `arrow`, because it 
was just created, without any used in production(BTW I am new to rust).
   
   We can keep this MR open currently until you think it's safe enough or must 
have it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-01 Thread GitBox


sundy-li edited a comment on pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#issuecomment-788550187


   I removed `pdqsort`, because the performance didn't show any improvement 
during the benches in my pc.
   Added a separate function `partial_sort `as @jorgecarleitao suggested.
   
   I did not find any other `partial_sort` function in Rust like 
std::partial_sort in c++.  So I created 
[one](https://github.com/sundy-li/partial_sort).
   
   And the [benche results about partial_sort in 
arrow](https://github.com/apache/arrow/blob/dc3167e3826177af1a6feec516a528a4ee38c674/rust/arrow/benches/sort_kernel.rs#L76-L106)
 are:

   
   ```
   sort 2^12 time:   [753.58 us 755.43 us 758.14 us]
   sort nulls 2^12   time:   [633.41 us 635.28 us 637.51 us]
   
   sort 2^12 limit 10time:   [49.246 us 49.820 us 50.667 us]
   sort 2^12 limit 100   time:   [115.11 us 116.26 us 117.76 us]
   sort 2^12 limit 1000  time:   [645.91 us 654.36 us 663.78 us]
   sort nulls 2^12 limit 10  time:   [66.283 us 66.725 us 67.347 us]
   sort nulls 2^12 limit 100 time:   [76.281 us 77.907 us 79.783 us]
   sort nulls 2^12 limit 1000time:   [258.98 us 260.32 us 262.24 us]
   
   
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-01 Thread GitBox


sundy-li edited a comment on pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#issuecomment-788550187


   I removed `pdqsort`, because the performance didn't show any improvement 
during the benches in my pc.
   Added a separate function `partial_sort `as @jorgecarleitao suggested.
   
   I did not find any other `partial_sort` function in Rust like 
std::partial_sort in c++.  So I created 
[one](https://github.com/sundy-li/partial_sort).
   
   And the [benche results about partial_sort in 
arrow](https://github.com/apache/arrow/blob/f19fc1644a07cefcf584d450ad70c5708926c252/rust/arrow/benches/sort_kernel.rs#L76-L106)
 are:

   
   ```
   sort 2^12 time:   [753.58 us 755.43 us 758.14 us]
   sort nulls 2^12   time:   [633.41 us 635.28 us 637.51 us]
   
   sort 2^12 limit 10time:   [49.246 us 49.820 us 50.667 us]
   sort 2^12 limit 100   time:   [115.11 us 116.26 us 117.76 us]
   sort 2^12 limit 1000  time:   [645.91 us 654.36 us 663.78 us]
   sort nulls 2^12 limit 10  time:   [66.283 us 66.725 us 67.347 us]
   sort nulls 2^12 limit 100 time:   [76.281 us 77.907 us 79.783 us]
   sort nulls 2^12 limit 1000time:   [258.98 us 260.32 us 262.24 us]
   
   
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-01 Thread GitBox


sundy-li edited a comment on pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#issuecomment-788550187


   I removed `pdqsort`, because the performance didn't show any improvement 
during the benches in my pc.
   Added a separate function `partial_sort `as @jorgecarleitao suggested.
   
   I did not find any other `partial_sort` function in Rust like 
std::partial_sort in c++.  So I created 
[one](https://github.com/sundy-li/partial_sort).
   
   And the [benche results about partial_sort in 
arrow](https://github.com/apache/arrow/blob/f19fc1644a07cefcf584d450ad70c5708926c252/rust/arrow/benches/sort_kernel.rs#L76-L106)
 are:

   
   ```
   sort 2^12 time:   [753.58 us 755.43 us 758.14 us]
   sort nulls 2^12   time:   [633.41 us 635.28 us 637.51 us]
   
   sort 2^12 limit 10time:   [49.246 us 49.820 us 50.667 us]
   sort 2^12 limit 100   time:   [115.11 us 116.26 us 117.76 us]
   sort 2^12 limit 1000  time:   [113.61 us 114.21 us 114.85 us]
   sort nulls 2^12 limit 10  time:   [66.283 us 66.725 us 67.347 us]
   sort nulls 2^12 limit 100 time:   [66.455 us 66.932 us 67.504 us]
   sort nulls 2^12 limit 1000time:   [65.030 us 65.218 us 65.412 us]
   
   
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-01 Thread GitBox


sundy-li edited a comment on pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#issuecomment-788550187


   I removed `pdqsort`, because the performance didn't show any improvement 
during the benches.
   Added a separate function `partial_sort `as @jorgecarleitao suggested.
   
   I did not find any other `partial_sort` function in Rust like 
std::partial_sort in c++.  So I created 
[one](https://github.com/sundy-li/partial_sort).
   
   And the [benche results about partial_sort in 
arrow](https://github.com/apache/arrow/blob/f19fc1644a07cefcf584d450ad70c5708926c252/rust/arrow/benches/sort_kernel.rs#L76-L106)
 are:

   
   ```
   sort 2^12 time:   [753.58 us 755.43 us 758.14 us]
   sort nulls 2^12   time:   [633.41 us 635.28 us 637.51 us]
   
   sort 2^12 limit 10time:   [49.246 us 49.820 us 50.667 us]
   sort 2^12 limit 100   time:   [115.11 us 116.26 us 117.76 us]
   sort 2^12 limit 1000  time:   [113.61 us 114.21 us 114.85 us]
   sort nulls 2^12 limit 10  time:   [66.283 us 66.725 us 67.347 us]
   sort nulls 2^12 limit 100 time:   [66.455 us 66.932 us 67.504 us]
   sort nulls 2^12 limit 1000time:   [65.030 us 65.218 us 65.412 us]
   
   
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-01 Thread GitBox


sundy-li edited a comment on pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#issuecomment-788550187


   I removed `pdqsort`, because the performance didn't show any improvement 
during the benches.
   Added a separate function `partial_sort `as @jorgecarleitao suggested.
   
   I did not find any other `partial_sort` function in Rust like 
std::partial_sort in c++.  So I created 
[one](https://github.com/sundy-li/partial_sort).
   
   And the [benche results about partial_sort in 
arrow](https://github.com/apache/arrow/blob/f19fc1644a07cefcf584d450ad70c5708926c252/rust/arrow/benches/sort_kernel.rs#L76-L106)
 are:

   
   ```
   sort 2^10 time:   [156.26 us 156.97 us 157.88 us]
   sort 2^12 time:   [753.58 us 755.43 us 758.14 us]
   sort nulls 2^10   time:   [133.73 us 134.42 us 135.27 us]
   sort nulls 2^12   time:   [633.41 us 635.28 us 637.51 us]
   
   sort 2^12 limit 10time:   [49.246 us 49.820 us 50.667 us]
   sort 2^12 limit 100   time:   [115.11 us 116.26 us 117.76 us]
   sort 2^12 limit 1000  time:   [113.61 us 114.21 us 114.85 us]
   sort nulls 2^12 limit 10  time:   [66.283 us 66.725 us 67.347 us]
   sort nulls 2^12 limit 100 time:   [66.455 us 66.932 us 67.504 us]
   sort nulls 2^12 limit 1000time:   [65.030 us 65.218 us 65.412 us]
   
   
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org