[GitHub] [arrow-datafusion] andygrove opened a new issue, #4782: Parquet files generated by DataFusion cannot be read by Apache Spark

GitBox Sat, 31 Dec 2022 08:36:45 -0800


andygrove opened a new issue, #4782:
URL: https://github.com/apache/arrow-datafusion/issues/4782


   **Describe the bug**
   I generated TPC-H data and converted to Parquet using DataFusion. Here is 
the `nation` table.
   
   ```
   $ ls -l /tmp/tpch-parquet/nation.parquet/
   total 4
   drwxrwxr-x 2 andy andy 4096 Dec 31 09:25 part-0.parquet
   ```
   
   I can read the schema fine from [bdt](https://github.com/andygrove/bdt) 
(which uses DataFusion)
   
   ```
   $ bdt schema /tmp/tpch-parquet/nation.parquet
   +-------------+-----------+-------------+
   | column_name | data_type | is_nullable |
   +-------------+-----------+-------------+
   | n_nationkey | Int64     | NO          |
   | n_name      | Utf8      | NO          |
   | n_regionkey | Int64     | NO          |
   | n_comment   | Utf8      | NO          |
   +-------------+-----------+-------------+
   ```
   
   Spark fails with:
   
   ```
   val df = spark.read.parquet("/tmp/tpch-parquet/nation.parquet")
   org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
It must be specified manually.
   ```
   
   However, if I ask Spark to read the one partition file directly, and not the 
directory, then it works, which confuses me,
   
   ```
   scala> val df = 
spark.read.parquet("/tmp/tpch-parquet/nation.parquet/part-0.parquet")
   df: org.apache.spark.sql.DataFrame = [n_nationkey: bigint, n_name: string 
... 2 more fields]
   
   scala> df.schema
   res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(n_nationkey,LongType,true), 
StructField(n_name,StringType,true), StructField(n_regionkey,LongType,true), 
StructField(n_comment,StringType,true))
   ```
   
   **To Reproduce**
   
   
   **Expected behavior**
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] andygrove opened a new issue, #4782: Parquet files generated by DataFusion cannot be read by Apache Spark

Reply via email to