llama90 commented on issue #38074:
URL: https://github.com/apache/arrow/issues/38074#issuecomment-1750945758
Hello, I am creating the schema and generating data as follows for testing.
It seems like a different issue from a Swiss join. Even in the smallest
case, when both table_1 and table_2 have 1025 identical records each, the
result still shows 1024 join results. When you change the number of right
records to 1537, you will still get 1025 values.
Based on the information you mentioned, I became interested and conducted a
test by increasing the number of records to 100,000.
* Left: 100,000
* R: 100,000
* Match: 100,000
* without BloomFilter: 100,000
* with BloomFilter: 84,334
In this case, I was able to confirm that the results were coming out in
batches of 32,768 each. Maybe it is a morsel unit.
* Left: 32,768
* R: 32,768
* Match: 32,768
* without BloomFilter: 32,768
* with BloomFilter: 17,010
In my opinion, it seems clear that there might be an issue with the Bloom
Filter. Cases where accurate results were obtained, whether using BloomFilter
or not, were as follows:
* When the number of match records is 1024 or less, regardless of the number
of records in the Left and Right tables.
* When the number of match records exceeds 1024, but the number of records
in the Right table is greater than the number of match records.
* For example, R - 5000, M - 5000, L - 10000 (In this case, reducing it
to 9000 records results in 4904 as the output).
Below is an example of the tables.
#### table_1 (`left`)
* schema (`{timestamp(TimeUnit::MICRO, "UTC"), large_utf8(), large_utf8()}`)
* col_1: All values are `2023-09-14 21:43:18.678917`
* col_2: 1, 2, 3, 4, 5 ... N, -1, -1, -1, -1, -1, ... -1
* col_3: All values are `foo`
* `N` is the number of matching records between `table 1` and `table 2`
#### table_2 (`right`)
* schema (`{timestamp(TimeUnit::MICRO, "UTC"), large_utf8(), large_utf8(),
large_utf8()}`)
* col_1: All values are `2023-09-14 21:43:18.678917`
* col_2: 1, 2, 3, 4, 5 ... M
* col_3: All values are `foo`
* col_4: All values are `bar`
* `M` is the number of records for `table 2`
#### Example
* The number of records for `table 1`: 5
* N: 3
* M: 4
table 1
col_1 | col_2 | col_3
--- | --- | ---
2023-09-14 21:43:18.678917 | 1 | foo
2023-09-14 21:43:18.678917 | 2 | foo
2023-09-14 21:43:18.678917 | 3 | foo
2023-09-14 21:43:18.678917 | -1 | foo
2023-09-14 21:43:18.678917 | -1 | foo
table 2
col_1 | col_2 | col_3 | col_4
--- | --- | --- | ---
2023-09-14 21:43:18.678917 | 1 | foo | bar
2023-09-14 21:43:18.678917 | 2 | foo | bar
2023-09-14 21:43:18.678917 | 3 | foo | bar
2023-09-14 21:43:18.678917 | 4 | foo | bar
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]