llama90 commented on issue #38074:
URL: https://github.com/apache/arrow/issues/38074#issuecomment-1750945758

   Hello, I am creating the schema and generating data as follows for testing.
   
   It seems like a different issue from a Swiss join. Even in the smallest 
case, when both table_1 and table_2 have 1025 identical records each, the 
result still shows 1024 join results. When you change the number of right 
records to 1537, you will still get 1025 values.
   
   Based on the information you mentioned, I became interested and conducted a 
test by increasing the number of records to 100,000.
   
   * Left: 100,000
   * R: 100,000
   * Match: 100,000
       * without BloomFilter: 100,000
       * with BloomFilter: 84,334
   
   In this case, I was able to confirm that the results were coming out in 
batches of 32,768 each. Maybe it is a morsel unit.
   
   * Left: 32,768
   * R: 32,768
   * Match: 32,768
       * without BloomFilter: 32,768
       * with BloomFilter: 17,010
   
   In my opinion, it seems clear that there might be an issue with the Bloom 
Filter. Cases where accurate results were obtained, whether using BloomFilter 
or not, were as follows:
   
   * When the number of match records is 1024 or less, regardless of the number 
of records in the Left and Right tables.
   * When the number of match records exceeds 1024, but the number of records 
in the Right table is greater than the number of match records.
       * For example, R - 5000, M - 5000, L - 10000 (In this case, reducing it 
to 9000 records results in 4904 as the output).
   
   Below is an example of the tables.
   
   #### table_1 (`left`)
   
   * schema (`{timestamp(TimeUnit::MICRO, "UTC"), large_utf8(), large_utf8()}`)
       * col_1: All values are `2023-09-14 21:43:18.678917`
       * col_2: 1, 2, 3, 4, 5 ... N, -1, -1, -1, -1, -1, ... -1
       * col_3: All values are `foo` 
   * `N` is the number of matching records between `table 1` and `table 2`
   
   #### table_2 (`right`)
   
   * schema (`{timestamp(TimeUnit::MICRO, "UTC"), large_utf8(), large_utf8(), 
large_utf8()}`)
       * col_1: All values are `2023-09-14 21:43:18.678917`
       * col_2: 1, 2, 3, 4, 5 ... M
       * col_3: All values are `foo` 
       * col_4: All values are `bar`
   * `M` is the number of records for `table 2`
   
   #### Example
   
   * The number of records for `table 1`: 5
   * N: 3
   * M: 4
   
   table 1 
   
   col_1 | col_2 | col_3
   --- | --- | ---
   2023-09-14 21:43:18.678917 | 1 | foo
   2023-09-14 21:43:18.678917 | 2 | foo
   2023-09-14 21:43:18.678917 | 3 | foo
   2023-09-14 21:43:18.678917 | -1 | foo
   2023-09-14 21:43:18.678917 | -1 | foo
   
   table 2
   
   col_1 | col_2 | col_3 | col_4
   --- | --- | --- | ---
   2023-09-14 21:43:18.678917 | 1 | foo | bar
   2023-09-14 21:43:18.678917 | 2 | foo | bar
   2023-09-14 21:43:18.678917 | 3 | foo | bar
   2023-09-14 21:43:18.678917 | 4 | foo | bar
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to