[GitHub] [arrow-datafusion] andygrove opened a new issue #1648: Cannot query parquet files generated by Apache Spark from datafusion-cli

GitBox Sat, 22 Jan 2022 16:29:11 -0800


andygrove opened a new issue #1648:
URL: https://github.com/apache/arrow-datafusion/issues/1648



   **Describe the bug**
   
   I have a data set created by Apache Spark and I tried to query it from the 
DataFusion CLI. It failed, saying that a parquet file was corrupt.
   
   ```
    CREATE EXTERNAL TABLE store_sales STORED AS PARQUET LOCATION 
'store_sales.dat';
   0 rows in set. Query took 0.002 seconds.
   ❯ select count(*) from store_sales;
   Parquet reader thread terminated due to error: ParquetError(General("Invalid 
Parquet file. Corrupt footer"))
   ```
   
   I added some debug logging and found that it was actually trying to read the 
following file, which is not a Parquet file.
   
   ```
   
store_sales.dat/.part-00005-5142b177-bacb-499d-b14f-12de4b94d9d9-c000.snappy.parquet.crc
   ```
   
   **To Reproduce**
   Create a non-Parquet file with a non-Parquet extension and put it in a 
directory along with some valid parquet files.
   
   **Expected behavior**
   Should only try and read files with file extension `.parquet`.
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] andygrove opened a new issue #1648: Cannot query parquet files generated by Apache Spark from datafusion-cli

Reply via email to