jwimberl commented on issue #9797: URL: https://github.com/apache/arrow-datafusion/issues/9797#issuecomment-2038057125
Perhaps the issue was with the syntax for the partitioned parquet file location(s). I did not change the wildcard based syntax I had been using before, shown at the top of this issue. However, I see in the docs now that the location only need be the parent directory of the top-level partition: https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause Using a version of that example NYC taxi data (I actually just downloaded one file), I did verify that the wildcard based location doesn't produce an error during table creation but does not work. The error is slightly different -- the table is completely empty -- but maybe it's the same issue ``` ## $ wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet --2024-04-04 18:38:48-- https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 13.32.192.124, 13.32.192.116, 13.32.192.190, ... Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|13.32.192.124|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 49961641 (48M) [binary/octet-stream] Saving to: ‘yellow_tripdata_2024-01.parquet’ yellow_tripdata_2024-01.parquet 100%[=====================================================================================>] 47.65M 79.9MB/s in 0.6s 2024-04-04 18:38:48 (79.9 MB/s) - ‘yellow_tripdata_2024-01.parquet’ saved [49961641/49961641] ## $ mkdir -p year=2024/month=01 ## $ mv yellow_tripdata_2024-01.parquet year\=2024/month\=01/tripdata.parquet ## $ python3 Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import datafusion as df >>> df.__version__ '36.0.0' >>> ctx = df.SessionContext() >>> ctx.sql(""" ... CREATE EXTERNAL TABLE taxi ... STORED AS PARQUET ... PARTITIONED BY (year, month) ... LOCATION '/path/to/nyctaxi'; ... """) DataFrame() ++ ++ >>> ctx.sql("SELECT COUNT(*) FROM taxi;") DataFrame() +----------+ | COUNT(*) | +----------+ | 2964624 | +----------+ >>> ctx.sql(""" ... CREATE EXTERNAL TABLE taxi2 ... STORED AS PARQUET ... PARTITIONED BY (year, month) ... LOCATION '/path/to/nyctaxi/*/*/tripdata.parquet'; ... """) DataFrame() ++ ++ >>> ctx.sql("SELECT COUNT(*) FROM taxi2;") DataFrame() +----------+ | COUNT(*) | +----------+ | 0 | +----------+ >>> ``` I don't know if the wildcard syntax I used was what this DDL documentation page used to show or if I got the syntax wrong from the get-go, and it just happened to be an unsupported format that worked up until now? Naturally I'm attempting the query again with my original dataset, but since its a rather large amount of data re-recreating the external table will take some time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
