Re: [I] inner join involving hive-partitioned parquet dataset and filters on LHS and RHS causes panic [arrow-datafusion]

via GitHub Thu, 04 Apr 2024 12:32:31 -0700


jwimberl commented on issue #9797:
URL: 
https://github.com/apache/arrow-datafusion/issues/9797#issuecomment-2038057125


   Perhaps the issue was with the syntax for the partitioned parquet file 
location(s). I did not change the wildcard based syntax I had been using 
before, shown at the top of this issue. However, I see in the docs now that the 
location only need be the parent directory of the top-level partition:
   
   
https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause
   
   Using a version of that example NYC taxi data (I actually just downloaded 
one file), I did verify that the wildcard based location doesn't produce an 
error during table creation but does not work. The error is slightly different 
-- the table is completely empty -- but maybe it's the same issue
   
   ```
   ## $ wget 
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
   --2024-04-04 18:38:48--  
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
   Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 
13.32.192.124, 13.32.192.116, 13.32.192.190, ...
   Connecting to d37ci6vzurychx.cloudfront.net 
(d37ci6vzurychx.cloudfront.net)|13.32.192.124|:443... connected.
   HTTP request sent, awaiting response... 200 OK
   Length: 49961641 (48M) [binary/octet-stream]
   Saving to: ‘yellow_tripdata_2024-01.parquet’
   
   yellow_tripdata_2024-01.parquet           
100%[=====================================================================================>]
  47.65M  79.9MB/s    in 0.6s
   
   2024-04-04 18:38:48 (79.9 MB/s) - ‘yellow_tripdata_2024-01.parquet’ saved 
[49961641/49961641]
   
   ## $ mkdir -p year=2024/month=01
   ## $ mv yellow_tripdata_2024-01.parquet year\=2024/month\=01/tripdata.parquet
   ## $ python3
   Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 
12.3.0] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import datafusion as df
   >>> df.__version__
   '36.0.0'
   >>> ctx = df.SessionContext()
   >>> ctx.sql("""
   ... CREATE EXTERNAL TABLE taxi
   ... STORED AS PARQUET
   ... PARTITIONED BY (year, month)
   ... LOCATION '/path/to/nyctaxi';
   ... """)
   DataFrame()
   ++
   ++
   >>> ctx.sql("SELECT COUNT(*) FROM taxi;")
   DataFrame()
   +----------+
   | COUNT(*) |
   +----------+
   | 2964624  |
   +----------+
   >>> ctx.sql("""
   ... CREATE EXTERNAL TABLE taxi2
   ... STORED AS PARQUET
   ... PARTITIONED BY (year, month)
   ... LOCATION '/path/to/nyctaxi/*/*/tripdata.parquet';
   ... """)
   DataFrame()
   ++
   ++
   >>> ctx.sql("SELECT COUNT(*) FROM taxi2;")
   DataFrame()
   +----------+
   | COUNT(*) |
   +----------+
   | 0        |
   +----------+
   >>>
   ```
   
   I don't know if the wildcard syntax I used was what this DDL documentation 
page used to show or if I got the syntax wrong from the get-go, and it just 
happened to be an unsupported format that worked up until now? Naturally I'm 
attempting the query again with my original dataset, but since its a rather 
large amount of data re-recreating the external table will take some time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] inner join involving hive-partitioned parquet dataset and filters on LHS and RHS causes panic [arrow-datafusion]

Reply via email to