[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li edited a comment on pull request #9602: URL: https://github.com/apache/arrow/pull/9602#issuecomment-797294233 Hi, all. From other user's advice, partial_sort could be ``` pub fn partial_sort(v: [T], limit: usize, mut is_less: F) where F: FnMut(, ) -> Ordering, { let (before, _mid, _after) = v.select_nth_unstable_by(limit, is_less); before.sort_unstable_by(is_less); } ``` So I abandon to use of custom partial_sort. This could fix the issue and risk we are arguing about. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li edited a comment on pull request #9602: URL: https://github.com/apache/arrow/pull/9602#issuecomment-792216219 > For example, using a priority queue to keep only the top k values in memory. Yes, but lots of codes may duplicate with sort kernel. partial_sort used priority queue inside. It maybe good to do sorting in pipeline OLAP systems In ClickHouse, PartialSortingTransform(Each block in each thread) --> MergeSortingTransform(Blocks to one block in each thread) --> MergingSortedTransform(N Block in N Thread to one block) . ``` ┌─explain┐ │ (Expression) │ │ ExpressionTransform│ │ (Limit) │ │ Limit│ │ (MergingSorted)│ │ MergingSortedTransform 16 → 1 │ │ (MergeSorting) │ │ MergeSortingTransform × 16 │ │ (PartialSorting) │ │ LimitsCheckingTransform × 16 │ │ PartialSortingTransform × 16 │ │ (Expression) │ │ ExpressionTransform × 16 │ │ (SettingQuotaAndLimits) │ │ (ReadFromStorage) │ │ NumbersMt × 16 0 → 1 │ └┘ ``` > I just feel that the sheer amount of uncommented unsafe on it is an indication that there is some opportunity to revisit and further improve the crate, thereby reducing the risks of undefined behavior (UB). @alamb @jorgecarleitao Thanks for all your reviews. I also have the consideration about unsafe codes in partial_sort may break `arrow`, because it was just created, without any used in production(BTW I am new to rust). We can keep this MR open currently until you think it's safe enough or must have it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li edited a comment on pull request #9602: URL: https://github.com/apache/arrow/pull/9602#issuecomment-788550187 I removed `pdqsort`, because the performance didn't show any improvement during the benches in my pc. Added a separate function `partial_sort `as @jorgecarleitao suggested. I did not find any other `partial_sort` function in Rust like std::partial_sort in c++. So I created [one](https://github.com/sundy-li/partial_sort). And the [benche results about partial_sort in arrow](https://github.com/apache/arrow/blob/dc3167e3826177af1a6feec516a528a4ee38c674/rust/arrow/benches/sort_kernel.rs#L76-L106) are: ``` sort 2^12 time: [753.58 us 755.43 us 758.14 us] sort nulls 2^12 time: [633.41 us 635.28 us 637.51 us] sort 2^12 limit 10time: [49.246 us 49.820 us 50.667 us] sort 2^12 limit 100 time: [115.11 us 116.26 us 117.76 us] sort 2^12 limit 1000 time: [645.91 us 654.36 us 663.78 us] sort nulls 2^12 limit 10 time: [66.283 us 66.725 us 67.347 us] sort nulls 2^12 limit 100 time: [76.281 us 77.907 us 79.783 us] sort nulls 2^12 limit 1000time: [258.98 us 260.32 us 262.24 us] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li edited a comment on pull request #9602: URL: https://github.com/apache/arrow/pull/9602#issuecomment-788550187 I removed `pdqsort`, because the performance didn't show any improvement during the benches in my pc. Added a separate function `partial_sort `as @jorgecarleitao suggested. I did not find any other `partial_sort` function in Rust like std::partial_sort in c++. So I created [one](https://github.com/sundy-li/partial_sort). And the [benche results about partial_sort in arrow](https://github.com/apache/arrow/blob/f19fc1644a07cefcf584d450ad70c5708926c252/rust/arrow/benches/sort_kernel.rs#L76-L106) are: ``` sort 2^12 time: [753.58 us 755.43 us 758.14 us] sort nulls 2^12 time: [633.41 us 635.28 us 637.51 us] sort 2^12 limit 10time: [49.246 us 49.820 us 50.667 us] sort 2^12 limit 100 time: [115.11 us 116.26 us 117.76 us] sort 2^12 limit 1000 time: [645.91 us 654.36 us 663.78 us] sort nulls 2^12 limit 10 time: [66.283 us 66.725 us 67.347 us] sort nulls 2^12 limit 100 time: [76.281 us 77.907 us 79.783 us] sort nulls 2^12 limit 1000time: [258.98 us 260.32 us 262.24 us] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li edited a comment on pull request #9602: URL: https://github.com/apache/arrow/pull/9602#issuecomment-788550187 I removed `pdqsort`, because the performance didn't show any improvement during the benches in my pc. Added a separate function `partial_sort `as @jorgecarleitao suggested. I did not find any other `partial_sort` function in Rust like std::partial_sort in c++. So I created [one](https://github.com/sundy-li/partial_sort). And the [benche results about partial_sort in arrow](https://github.com/apache/arrow/blob/f19fc1644a07cefcf584d450ad70c5708926c252/rust/arrow/benches/sort_kernel.rs#L76-L106) are: ``` sort 2^12 time: [753.58 us 755.43 us 758.14 us] sort nulls 2^12 time: [633.41 us 635.28 us 637.51 us] sort 2^12 limit 10time: [49.246 us 49.820 us 50.667 us] sort 2^12 limit 100 time: [115.11 us 116.26 us 117.76 us] sort 2^12 limit 1000 time: [113.61 us 114.21 us 114.85 us] sort nulls 2^12 limit 10 time: [66.283 us 66.725 us 67.347 us] sort nulls 2^12 limit 100 time: [66.455 us 66.932 us 67.504 us] sort nulls 2^12 limit 1000time: [65.030 us 65.218 us 65.412 us] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li edited a comment on pull request #9602: URL: https://github.com/apache/arrow/pull/9602#issuecomment-788550187 I removed `pdqsort`, because the performance didn't show any improvement during the benches. Added a separate function `partial_sort `as @jorgecarleitao suggested. I did not find any other `partial_sort` function in Rust like std::partial_sort in c++. So I created [one](https://github.com/sundy-li/partial_sort). And the [benche results about partial_sort in arrow](https://github.com/apache/arrow/blob/f19fc1644a07cefcf584d450ad70c5708926c252/rust/arrow/benches/sort_kernel.rs#L76-L106) are: ``` sort 2^12 time: [753.58 us 755.43 us 758.14 us] sort nulls 2^12 time: [633.41 us 635.28 us 637.51 us] sort 2^12 limit 10time: [49.246 us 49.820 us 50.667 us] sort 2^12 limit 100 time: [115.11 us 116.26 us 117.76 us] sort 2^12 limit 1000 time: [113.61 us 114.21 us 114.85 us] sort nulls 2^12 limit 10 time: [66.283 us 66.725 us 67.347 us] sort nulls 2^12 limit 100 time: [66.455 us 66.932 us 67.504 us] sort nulls 2^12 limit 1000time: [65.030 us 65.218 us 65.412 us] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sundy-li edited a comment on pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel
sundy-li edited a comment on pull request #9602: URL: https://github.com/apache/arrow/pull/9602#issuecomment-788550187 I removed `pdqsort`, because the performance didn't show any improvement during the benches. Added a separate function `partial_sort `as @jorgecarleitao suggested. I did not find any other `partial_sort` function in Rust like std::partial_sort in c++. So I created [one](https://github.com/sundy-li/partial_sort). And the [benche results about partial_sort in arrow](https://github.com/apache/arrow/blob/f19fc1644a07cefcf584d450ad70c5708926c252/rust/arrow/benches/sort_kernel.rs#L76-L106) are: ``` sort 2^10 time: [156.26 us 156.97 us 157.88 us] sort 2^12 time: [753.58 us 755.43 us 758.14 us] sort nulls 2^10 time: [133.73 us 134.42 us 135.27 us] sort nulls 2^12 time: [633.41 us 635.28 us 637.51 us] sort 2^12 limit 10time: [49.246 us 49.820 us 50.667 us] sort 2^12 limit 100 time: [115.11 us 116.26 us 117.76 us] sort 2^12 limit 1000 time: [113.61 us 114.21 us 114.85 us] sort nulls 2^12 limit 10 time: [66.283 us 66.725 us 67.347 us] sort nulls 2^12 limit 100 time: [66.455 us 66.932 us 67.504 us] sort nulls 2^12 limit 1000time: [65.030 us 65.218 us 65.412 us] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org