franeklubi opened a new issue #1441:
URL: https://github.com/apache/arrow-datafusion/issues/1441


   **Describe the bug**
   I came upon a bug while querying my custom Parquet dataset, which causes 
DataFusion to produce incoherent and incorrect results.
   
   I tested my dataset in various ways, all of which produced the desired 
results:
   - reading parquet files using python pandas, then merging and filtering the 
data there
   - encoding into CSV, and reading the data with DataFusion
   - creating an SQLite database using the provided CSV files, and using the 
same queries there
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. Download all the code and data I used for testing:
   
   
[issue_data.zip](https://github.com/apache/arrow-datafusion/files/7703604/issue_data.zip)
   
   Inside there are the Parquet files and CSVs with exactly the same data 
(also, there's an sqlite database created from the provided CSV files).
   
   2. Use the instructions included in `README.md` to reproduce the issue:
   
   The query, that fails when querying Parquet files with datafusion-cli:
   ```sql
   -- 1. Distinct stop names
   SELECT DISTINCT stop_name FROM stop INNER JOIN trip ON tid = trip_tid WHERE 
line = '176' ORDER BY stop_name NULLS LAST;
   ```
   Change only in `where` from `line` to `trip_line` produces the desired 
results.
   
   **Expected behavior**
   Should produce these 27 rows:
   ```
   Bartnicza
   Bazyliańska
   Bolesławicka
   Brzezińska
   Budowlana
   Choszczówka
   Chłodnia
   Daniszewska
   Fabryka Pomp
   Insurekcji
   Marcelin
   Marywilska-Las
   Ołówkowa
   PKP Płudy
   PKP Żerań
   Parowozowa
   Pelcowizna
   Polnych Kwiatów
   Raciborska
   Rembielińska
   Sadkowska
   Smugowa
   Starego Dębu
   Zyndrama z Maszkowic
   os.Marywilska
   Śpiewaków
   None
   ```
   
   Query 1 from `README.md` (mentioned above) produces this incorrect set of 33 
rows:
   ```
   +----------------------+
   | stop_name            |
   +----------------------+
   | Bartnicza            |
   | Bazyliańska          |
   | Bolesławicka         |
   | Brzezińska           |
   | Budowlana            |
   | Choszczówka          |
   | Chłodnia             |
   | Cygańska             |
   | Czołgistów           |
   | Daniszewska          |
   | Fabryka Pomp         |
   | Insurekcji           |
   | Majerankowa          |
   | Marcelin             |
   | Marywilska-Las       |
   | Ołówkowa             |
   | PKP Falenica         |
   | PKP Płudy            |
   | PKP Żerań            |
   | Parowozowa           |
   | Pelcowizna           |
   | Polnych Kwiatów      |
   | Raciborska           |
   | Rembielińska         |
   | Rokosowska           |
   | Sadkowska            |
   | Smugowa              |
   | Starego Dębu         |
   | Zbójna Góra          |
   | Zyndrama z Maszkowic |
   | os.Marywilska        |
   | Śpiewaków            |
   |                      |
   +----------------------+
   ```
   
   **Additional context**
   Datafusion version:
   ```sh
   $ datafusion-cli --version
   DataFusion 5.1.0
   ```
   
   **My guess**
   Since the Parquet files have encoded NULLs, and reading the CSV files with 
`datafusion-cli` gets rid of those, my best bet is on the usage of NULLs and 
some weir behavior when joining.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to