ZhangHuiGui commented on PR #40998:
URL: https://github.com/apache/arrow/pull/40998#issuecomment-2060699644

   > are_cols_in_encoding_order=true
   
   It's nice question!
   
   Use `are_cols_in_encoding_order=true` in Grouper to perform correct 
comparison. This process can be broken down into the following three steps:
   
   1. **Column sort**
   After the column sorting in `FromColumnMetadataVector` is completed, an 
`inverse_column_order` will be generated to identify the column order after 
sorting. For example, [0,1,2] before sorting, and `inverse_column_order` will 
save [2, 1, 0] after sorting. At the same time, a `column_offsets` array will 
also be generated, used to identify the offset position of the sorted column in 
the row, so that the column data can be accessed in `RowTableImpl`, it should 
be noted that the order of this array is constructed according to the **input 
column order.**
   
   2. **Encode columns into RowTableEncoder in sorted order**
   
https://github.com/apache/arrow/blob/63c91ff32b8547f0bfd6ff827d7a6901d9e7ca5c/cpp/src/arrow/compute/row/grouper.cc#L680
     `column_order` has been changed in step 1.
    
https://github.com/apache/arrow/blob/63c91ff32b8547f0bfd6ff827d7a6901d9e7ca5c/cpp/src/arrow/compute/row/encode_internal.cc#L34-L44
   
   3. **Compare the data in RowTableEncoder and `RowTableImpl`** (at this time, 
the RowTableEncoder data in the grouper is already sorted).
   Compare through KeyCompare::CompareColumnsToRows, there are two main places 
where `are_cols_in_encoding_order` is used
   a. Get the column_offsets in step1. If are_cols_in_encoding_order = true in 
this process, you can directly use the sorting column to get the offset. The 
current column is already a sorted column; otherwise, you need to use 
inverse_column_order to access the offset.
   
https://github.com/apache/arrow/blob/63c91ff32b8547f0bfd6ff827d7a6901d9e7ca5c/cpp/src/arrow/compute/row/compare_internal.cc#L366-L369
    b. Get the position of the current column in null_mask. You also need to 
judge whether you need to take col-idx from inverse_column_order or directly 
use the input column as col-idx according to `are_cols_in_encoding_order`. The 
problem to be solved by the current PR is actually the loss in 
`NullUpdateColumnToRowImp_avx2` The check for are_cols_in_encoding_order = true 
resulted in getting a wrong bit_id as a judgment condition for whether the 
corresponding column is empty.
     `NullUpdateColumnToRow` 
https://github.com/apache/arrow/blob/63c91ff32b8547f0bfd6ff827d7a6901d9e7ca5c/cpp/src/arrow/compute/row/compare_internal.cc#L53-L54
    `NullUpdateColumnToRowImp_avx2`
   
https://github.com/apache/arrow/blob/63c91ff32b8547f0bfd6ff827d7a6901d9e7ca5c/cpp/src/arrow/compute/row/compare_internal_avx2.cc#L47
   
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to