ZhangHuiGui commented on PR #40998: URL: https://github.com/apache/arrow/pull/40998#issuecomment-2060699644
> are_cols_in_encoding_order=true It's nice question! Use `are_cols_in_encoding_order=true` in Grouper to perform correct comparison. This process can be broken down into the following three steps: 1. **Column sort** After the column sorting in `FromColumnMetadataVector` is completed, an `inverse_column_order` will be generated to identify the column order after sorting. For example, [0,1,2] before sorting, and `inverse_column_order` will save [2, 1, 0] after sorting. At the same time, a `column_offsets` array will also be generated, used to identify the offset position of the sorted column in the row, so that the column data can be accessed in `RowTableImpl`, it should be noted that the order of this array is constructed according to the **input column order.** 2. **Encode columns into RowTableEncoder in sorted order** https://github.com/apache/arrow/blob/63c91ff32b8547f0bfd6ff827d7a6901d9e7ca5c/cpp/src/arrow/compute/row/grouper.cc#L680 `column_order` has been changed in step 1. https://github.com/apache/arrow/blob/63c91ff32b8547f0bfd6ff827d7a6901d9e7ca5c/cpp/src/arrow/compute/row/encode_internal.cc#L34-L44 3. **Compare the data in RowTableEncoder and `RowTableImpl`** (at this time, the RowTableEncoder data in the grouper is already sorted). Compare through KeyCompare::CompareColumnsToRows, there are two main places where `are_cols_in_encoding_order` is used a. Get the column_offsets in step1. If are_cols_in_encoding_order = true in this process, you can directly use the sorting column to get the offset. The current column is already a sorted column; otherwise, you need to use inverse_column_order to access the offset. https://github.com/apache/arrow/blob/63c91ff32b8547f0bfd6ff827d7a6901d9e7ca5c/cpp/src/arrow/compute/row/compare_internal.cc#L366-L369 b. Get the position of the current column in null_mask. You also need to judge whether you need to take col-idx from inverse_column_order or directly use the input column as col-idx according to `are_cols_in_encoding_order`. The problem to be solved by the current PR is actually the loss in `NullUpdateColumnToRowImp_avx2` The check for are_cols_in_encoding_order = true resulted in getting a wrong bit_id as a judgment condition for whether the corresponding column is empty. `NullUpdateColumnToRow` https://github.com/apache/arrow/blob/63c91ff32b8547f0bfd6ff827d7a6901d9e7ca5c/cpp/src/arrow/compute/row/compare_internal.cc#L53-L54 `NullUpdateColumnToRowImp_avx2` https://github.com/apache/arrow/blob/63c91ff32b8547f0bfd6ff827d7a6901d9e7ca5c/cpp/src/arrow/compute/row/compare_internal_avx2.cc#L47 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org