[GitHub] [arrow-datafusion] maxburke opened a new issue, #5470: Joins on type FixedSizeBinary(16) returning incorrect results

via GitHub Fri, 03 Mar 2023 08:06:18 -0800


maxburke opened a new issue, #5470:
URL: https://github.com/apache/arrow-datafusion/issues/5470


   (Stemming from #5456)
   
   I've attached two parquet files. Both files contain a single column with 
131072 rows, generated from Arrow with a single record batch. The 
`fsb16.parquet` file contains a column of type FixedSizeBinary(16), the 
`ints.parquet` contains a column of type `Int64`.
   
   If I do an inner join on the ints with itself, I get a result set of the 
expected 131072 rows:
   
   ```
   ❯ create external table t0 stored as parquet location 'ints.parquet';
   ❯ select * from t0 inner join t0 as t1 on t0.ints = t1.ints;
   +--------+--------+
   ...[snip]...
   +--------+--------+
   131072 rows in set. Query took 0.530 seconds.
   ```
   
   But if I do the same query with the FixedSizeBinary(16) inputs, it returns 
358946 rows (?):
   
   ```
   ❯ create external table t0 stored as parquet location 'fsb16.parquet';
   ❯ select * from t0 inner join t0 as t1 on t0.journey_id = t1.journey_id;
   +----------------------------------+----------------------------------+
   ...[snip]...
   +----------------------------------+----------------------------------+
   358946 rows in set. Query took 2.073 seconds.
   ```
   
   In this particular case, all the FixedSizeBinary(16) values are non-null, 
though I don't think that should make a difference.
   
   
[fsb16.parquet.gz](https://github.com/apache/arrow-datafusion/files/10875507/fsb16.parquet.gz)
   
[ints.parquet.gz](https://github.com/apache/arrow-datafusion/files/10875508/ints.parquet.gz)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] maxburke opened a new issue, #5470: Joins on type FixedSizeBinary(16) returning incorrect results

Reply via email to