[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

GitBox Thu, 16 Dec 2021 15:13:52 -0800


alamb commented on issue #1441:
URL: 
https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-996265753



   I spent some time looking at the parquet data and the csv data, and it looks 
to me like there may be something wrong with the parquet reader. 
   
   Specifically, I just ran a query that did a select *
   
   dump_parquet.sql:
   ```sql
   CREATE EXTERNAL TABLE stops_parquet
   STORED AS PARQUET
   LOCATION 
'/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/parquets/stops';
   
   show columns from stops_parquet;
   
   CREATE EXTERNAL TABLE trips_parquet
   STORED AS PARQUET
   LOCATION 
'/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/parquets/trips';
   
   show columns from trips_parquet;
   
   select * from stops_parquet order by time, trip_tid, trip_line, stop_name;
   select * from trips_parquet order by tid, line, base_day;
   ```
   
   and `dump_csv.sql`:
   
   ```sql
   CREATE EXTERNAL TABLE stops_csv (time timestamp, trip_tid bigint, trip_line 
TEXT, stop_name TEXT)
   STORED AS CSV WITH HEADER ROW
   LOCATION 
'/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/csvs/stop.csv';
   
   show columns from stops_csv;
   
   CREATE EXTERNAL TABLE trips_csv (tid bigint, line TEXT, base_day date)
   STORED AS CSV WITH HEADER ROW
   LOCATION 
'/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/csvs/trip.csv';
   
   show columns from trips_csv;
   
   select * from stops_csv order by time, trip_tid, trip_line, stop_name;
   select * from trips_csv order by tid, line, base_day;
   ```
   
   Like this:
   
   ```shell
   ~/Software/arrow-datafusion/target/debug/datafusion-cli -f dump_csv.sql  > 
dump_csv.txt
   ~/Software/arrow-datafusion/target/debug/datafusion-cli -f dump_parquet.sql  
> dump_parquet.txt
   ```
   
   The results are here: 
   
[dump_csv.txt](https://github.com/apache/arrow-datafusion/files/7730874/dump_csv.txt)
   
[dump_parquet.txt](https://github.com/apache/arrow-datafusion/files/7730875/dump_parquet.txt)
   
   And a quick visual diff shows they aren't the same
   
   The first few lines of `dump_csv.txt` look like
   ```
   +---------------------+----------+-----------+------------------------------+
   | time                | trip_tid | trip_line | stop_name                    |
   +---------------------+----------+-----------+------------------------------+
   | 2021-11-15 05:00:00 | 54761677 | N64       |                              |
   | 2021-11-15 05:00:00 | 54778942 | 204       |                              |
   | 2021-11-15 05:00:00 | 54788307 | 186       | RONDO ZESŁAŃCÓW SYBERYJSKICH |
   | 2021-11-15 05:00:00 | 54788967 | N41       |                              |
   | 2021-11-15 05:00:00 | 54788988 | N41       |                              |
   | 2021-11-15 05:00:00 | 54802937 | 104       |                              |
   ```
   
   While the first few lines of `dump_parquet.txt` l look like:
   
   ```
   +---------------------+----------+-----------+------------------------------+
   | time                | trip_tid | trip_line | stop_name                    |
   +---------------------+----------+-----------+------------------------------+
   | 2021-11-15 00:00:00 | 54761677 | N64       |                              |
   | 2021-11-15 00:00:00 | 54778942 | 204       |                              |
   | 2021-11-15 00:00:00 | 54788307 | 186       | Armatnia                     |
   | 2021-11-15 00:00:00 | 54788967 | N41       |                              |
   | 2021-11-15 00:00:00 | 54788988 | N41       |                              |
   | 2021-11-15 00:00:00 | 54802937 | 104       |                              |
   ```
   
   (note that the stop name is different)
   
   However, when I look for that mismatched line `trip_tid=54788307` in the 
data using pandas it does match with the csv:
   
   Here is the raw data in csv:
   ```shell
   $ grep 54788307 issue_data/csvs/stop.csv
   2021-11-15 00:00:00,54788307,186,RONDO ZESŁAŃCÓW SYBERYJSKICH
   
   ```
   
   Here is what comes out when using pandas:
   
   ```python
   Python 3.8.12 (default, Oct 13 2021, 06:42:42) 
   [Clang 13.0.0 (clang-1300.0.29.3)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pandas as pd
   import pandas as pd
   >>> df = pd.read_parquet('issue_data/parquets/stops/2021-11.parquet')
   df = pd.read_parquet('issue_data/parquets/stops/2021-11.parquet')
   >>> df.to_csv('/tmp/2021-11.csv')
   df.to_csv('/tmp/2021-11.csv')
   >>> 
   ```
   
   ```shell
   $ grep 54788307 /tmp/2021-11.csv 
   16591,2021-11-15 00:00:00,54788307.0,186,RONDO ZESŁAŃCÓW SYBERYJSKICH
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #1441: Incorrect results in datafusion

Reply via email to