franeklubi opened a new issue #1441: URL: https://github.com/apache/arrow-datafusion/issues/1441
**Describe the bug** I came upon a bug while querying my custom Parquet dataset, which causes DataFusion to produce incoherent and incorrect results. I tested my dataset in various ways, all of which produced the desired results: - reading parquet files using python pandas, then merging and filtering the data there - encoding into CSV, and reading the data with DataFusion - creating an SQLite database using the provided CSV files, and using the same queries there **To Reproduce** Steps to reproduce the behavior: 1. Download all the code and data I used for testing: [issue_data.zip](https://github.com/apache/arrow-datafusion/files/7703604/issue_data.zip) Inside there are the Parquet files and CSVs with exactly the same data (also, there's an sqlite database created from the provided CSV files). 2. Use the instructions included in `README.md` to reproduce the issue: The query, that fails when querying Parquet files with datafusion-cli: ```sql -- 1. Distinct stop names SELECT DISTINCT stop_name FROM stop INNER JOIN trip ON tid = trip_tid WHERE line = '176' ORDER BY stop_name NULLS LAST; ``` Change only in `where` from `line` to `trip_line` produces the desired results. **Expected behavior** Should produce these 27 rows: ``` Bartnicza Bazyliańska Bolesławicka Brzezińska Budowlana Choszczówka Chłodnia Daniszewska Fabryka Pomp Insurekcji Marcelin Marywilska-Las Ołówkowa PKP Płudy PKP Żerań Parowozowa Pelcowizna Polnych Kwiatów Raciborska Rembielińska Sadkowska Smugowa Starego Dębu Zyndrama z Maszkowic os.Marywilska Śpiewaków None ``` Query 1 from `README.md` (mentioned above) produces this incorrect set of 33 rows: ``` +----------------------+ | stop_name | +----------------------+ | Bartnicza | | Bazyliańska | | Bolesławicka | | Brzezińska | | Budowlana | | Choszczówka | | Chłodnia | | Cygańska | | Czołgistów | | Daniszewska | | Fabryka Pomp | | Insurekcji | | Majerankowa | | Marcelin | | Marywilska-Las | | Ołówkowa | | PKP Falenica | | PKP Płudy | | PKP Żerań | | Parowozowa | | Pelcowizna | | Polnych Kwiatów | | Raciborska | | Rembielińska | | Rokosowska | | Sadkowska | | Smugowa | | Starego Dębu | | Zbójna Góra | | Zyndrama z Maszkowic | | os.Marywilska | | Śpiewaków | | | +----------------------+ ``` **Additional context** Datafusion version: ```sh $ datafusion-cli --version DataFusion 5.1.0 ``` **My guess** Since the Parquet files have encoded NULLs, and reading the CSV files with `datafusion-cli` gets rid of those, my best bet is on the usage of NULLs and some weir behavior when joining. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
