maxburke opened a new issue, #5470: URL: https://github.com/apache/arrow-datafusion/issues/5470
(Stemming from #5456) I've attached two parquet files. Both files contain a single column with 131072 rows, generated from Arrow with a single record batch. The `fsb16.parquet` file contains a column of type FixedSizeBinary(16), the `ints.parquet` contains a column of type `Int64`. If I do an inner join on the ints with itself, I get a result set of the expected 131072 rows: ``` ❯ create external table t0 stored as parquet location 'ints.parquet'; ❯ select * from t0 inner join t0 as t1 on t0.ints = t1.ints; +--------+--------+ ...[snip]... +--------+--------+ 131072 rows in set. Query took 0.530 seconds. ``` But if I do the same query with the FixedSizeBinary(16) inputs, it returns 358946 rows (?): ``` ❯ create external table t0 stored as parquet location 'fsb16.parquet'; ❯ select * from t0 inner join t0 as t1 on t0.journey_id = t1.journey_id; +----------------------------------+----------------------------------+ ...[snip]... +----------------------------------+----------------------------------+ 358946 rows in set. Query took 2.073 seconds. ``` In this particular case, all the FixedSizeBinary(16) values are non-null, though I don't think that should make a difference. [fsb16.parquet.gz](https://github.com/apache/arrow-datafusion/files/10875507/fsb16.parquet.gz) [ints.parquet.gz](https://github.com/apache/arrow-datafusion/files/10875508/ints.parquet.gz) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
