alamb commented on issue #1441: URL: https://github.com/apache/arrow-datafusion/issues/1441#issuecomment-996265753
I spent some time looking at the parquet data and the csv data, and it looks to me like there may be something wrong with the parquet reader. Specifically, I just ran a query that did a select * dump_parquet.sql: ```sql CREATE EXTERNAL TABLE stops_parquet STORED AS PARQUET LOCATION '/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/parquets/stops'; show columns from stops_parquet; CREATE EXTERNAL TABLE trips_parquet STORED AS PARQUET LOCATION '/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/parquets/trips'; show columns from trips_parquet; select * from stops_parquet order by time, trip_tid, trip_line, stop_name; select * from trips_parquet order by tid, line, base_day; ``` and `dump_csv.sql`: ```sql CREATE EXTERNAL TABLE stops_csv (time timestamp, trip_tid bigint, trip_line TEXT, stop_name TEXT) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/csvs/stop.csv'; show columns from stops_csv; CREATE EXTERNAL TABLE trips_csv (tid bigint, line TEXT, base_day date) STORED AS CSV WITH HEADER ROW LOCATION '/Users/alamb/Documents/Wrong-answer-datafusion-1141/issue_data/csvs/trip.csv'; show columns from trips_csv; select * from stops_csv order by time, trip_tid, trip_line, stop_name; select * from trips_csv order by tid, line, base_day; ``` Like this: ```shell ~/Software/arrow-datafusion/target/debug/datafusion-cli -f dump_csv.sql > dump_csv.txt ~/Software/arrow-datafusion/target/debug/datafusion-cli -f dump_parquet.sql > dump_parquet.txt ``` The results are here: [dump_csv.txt](https://github.com/apache/arrow-datafusion/files/7730874/dump_csv.txt) [dump_parquet.txt](https://github.com/apache/arrow-datafusion/files/7730875/dump_parquet.txt) And a quick visual diff shows they aren't the same The first few lines of `dump_csv.txt` look like ``` +---------------------+----------+-----------+------------------------------+ | time | trip_tid | trip_line | stop_name | +---------------------+----------+-----------+------------------------------+ | 2021-11-15 05:00:00 | 54761677 | N64 | | | 2021-11-15 05:00:00 | 54778942 | 204 | | | 2021-11-15 05:00:00 | 54788307 | 186 | RONDO ZESŁAŃCÓW SYBERYJSKICH | | 2021-11-15 05:00:00 | 54788967 | N41 | | | 2021-11-15 05:00:00 | 54788988 | N41 | | | 2021-11-15 05:00:00 | 54802937 | 104 | | ``` While the first few lines of `dump_parquet.txt` l look like: ``` +---------------------+----------+-----------+------------------------------+ | time | trip_tid | trip_line | stop_name | +---------------------+----------+-----------+------------------------------+ | 2021-11-15 00:00:00 | 54761677 | N64 | | | 2021-11-15 00:00:00 | 54778942 | 204 | | | 2021-11-15 00:00:00 | 54788307 | 186 | Armatnia | | 2021-11-15 00:00:00 | 54788967 | N41 | | | 2021-11-15 00:00:00 | 54788988 | N41 | | | 2021-11-15 00:00:00 | 54802937 | 104 | | ``` (note that the stop name is different) However, when I look for that mismatched line `trip_tid=54788307` in the data using pandas it does match with the csv: Here is the raw data in csv: ```shell $ grep 54788307 issue_data/csvs/stop.csv 2021-11-15 00:00:00,54788307,186,RONDO ZESŁAŃCÓW SYBERYJSKICH ``` Here is what comes out when using pandas: ```python Python 3.8.12 (default, Oct 13 2021, 06:42:42) [Clang 13.0.0 (clang-1300.0.29.3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd import pandas as pd >>> df = pd.read_parquet('issue_data/parquets/stops/2021-11.parquet') df = pd.read_parquet('issue_data/parquets/stops/2021-11.parquet') >>> df.to_csv('/tmp/2021-11.csv') df.to_csv('/tmp/2021-11.csv') >>> ``` ```shell $ grep 54788307 /tmp/2021-11.csv 16591,2021-11-15 00:00:00,54788307.0,186,RONDO ZESŁAŃCÓW SYBERYJSKICH ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
