llama90 opened a new issue, #38074:
URL: https://github.com/apache/arrow/issues/38074
### Describe the bug, including details regarding any error messages,
version, and platform.
> Before explaining, if recreating previously submitted issues like this is
considered a mistake, I apologize.
>
> * #37729
### Overview
In the previous issue, the user is discussing the occurrence of incorrect
results when performing an inner join.
Although it was well-explained in the previous issue, to reiterate, the user
has created this issue because they are getting incorrect results when
performing an inner join between two tables, namely `table_1` and `table_2`.
Each of these tables (`table_1` and `table_2`) has columns `col_1`, `col_2`,
`col_3`, and `table_2` has an additional `col_4`. Upon my investigation, it
appears that `table_2.parquet` has 7 more records, but for `col_1`, `col_2`,
and `col_3`, it contains the same values as `table_1`.
The number of records in each table is 6282 and 6289, respectively.
So, when performing an inner join using `col_1`, `col_2`, and `col_3` as the
join keys, the result should be 6282, regardless of the order of the tables.
### Reason
To start with the cause, there is an issue with the **BloomFilter** logic.
When testing in C++, if you set the BloomFilter option
(`disable_bloom_filter`) to `true`, the join operation is performed without any
issues.
```cpp
// with bloom_filter (default)
HashJoinNodeOptions join_opts{
JoinType::INNER,
{"col_1", "col_2", "col_3"},
{"col_1", "col_2", "col_3"},
literal(true),
"_l",
"_r"
};
// without bloom_filter
HashJoinNodeOptions join_opts{
JoinType::INNER,
{"col_1", "col_2", "col_3"},
{"col_1", "col_2", "col_3"},
literal(true),
"_l",
"_r",
true // disable_bloom_filter
};
```
Additionally, the findings from further investigation are as follows.
1. The number of matching records between the left and right tables must
exceed 1024.
2. When the number of matching records is close to the number of records in
the right table, errors occur.
3. The number of matching records does not appear to have any correlation
with the number of records in the left table.
Up to this point, this is what I have gathered about the issue, and I am
working hard to fix the bug.
However, I am a novice about Arrow, which is causing it to take longer than
expected. Nonetheless, I will continue to make efforts to resolve the bug.
Any advice or insights from those who are more experienced would be greatly
appreciated, and it would also be great if someone with expertise could tackle
this issue first. So, I'm sharing this here.
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]