llama90 opened a new issue, #38074:
URL: https://github.com/apache/arrow/issues/38074

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   > Before explaining, if recreating previously submitted issues like this is 
considered a mistake, I apologize.
   >
   > * #37729 
   
   
   ### Overview
   
   In the previous issue, the user is discussing the occurrence of incorrect 
results when performing an inner join.
   
   Although it was well-explained in the previous issue, to reiterate, the user 
has created this issue because they are getting incorrect results when 
performing an inner join between two tables, namely `table_1` and `table_2`.
   
   Each of these tables (`table_1` and `table_2`) has columns `col_1`, `col_2`, 
`col_3`, and `table_2` has an additional `col_4`. Upon my investigation, it 
appears that `table_2.parquet` has 7 more records, but for `col_1`, `col_2`, 
and `col_3`, it contains the same values as `table_1`.
   
   The number of records in each table is 6282 and 6289, respectively.
   
   So, when performing an inner join using `col_1`, `col_2`, and `col_3` as the 
join keys, the result should be 6282, regardless of the order of the tables.
   
   ### Reason
   
   To start with the cause, there is an issue with the **BloomFilter** logic.
   
   When testing in C++, if you set the BloomFilter option 
(`disable_bloom_filter`) to `true`, the join operation is performed without any 
issues.
   
   ```cpp
   // with bloom_filter (default)
   HashJoinNodeOptions join_opts{
       JoinType::INNER, 
       {"col_1", "col_2", "col_3"}, 
       {"col_1", "col_2", "col_3"}, 
       literal(true), 
       "_l", 
       "_r"
   };
   
   // without bloom_filter
   HashJoinNodeOptions join_opts{
       JoinType::INNER, 
       {"col_1", "col_2", "col_3"}, 
       {"col_1", "col_2", "col_3"}, 
       literal(true), 
       "_l", 
       "_r",
       true // disable_bloom_filter
   };
   ```
   
   
   Additionally, the findings from further investigation are as follows.
   
   1. The number of matching records between the left and right tables must 
exceed 1024.
   2. When the number of matching records is close to the number of records in 
the right table, errors occur.
   3. The number of matching records does not appear to have any correlation 
with the number of records in the left table.
   
   Up to this point, this is what I have gathered about the issue, and I am 
working hard to fix the bug. 
   
   However, I am a novice about Arrow, which is causing it to take longer than 
expected. Nonetheless, I will continue to make efforts to resolve the bug. 
   
   Any advice or insights from those who are more experienced would be greatly 
appreciated, and it would also be great if someone with expertise could tackle 
this issue first. So, I'm sharing this here.
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to