[GitHub] [hudi] raj638111 opened a new issue #3473: Duplicate field: Partition field also available in parquet file

GitBox Fri, 13 Aug 2021 13:59:42 -0700


raj638111 opened a new issue #3473:
URL: https://github.com/apache/hudi/issues/3473



   **Describe the problem you faced**
   
   When querying the hudi table (I am querying as `parquet` format) from 
spark-shell, I am getting the following warning
   ```
   spark.read.parquet("s3://bucket1/huditable1").where("date = '20210101' and 
hour = '01'  and field1 = 'somevalue' ")
   WARN DataSource: Found duplicate column(s) in the data schema and the
     partition schema: `date`, `hour`
   ```
   On a close inspection, found that the parquet file also contains the same 
fields (ie `date` and `hour`)
   ```
   
println(spark.read.parquet("s3://bucket1/huditable1/date=20210101/hour=01/file1.parquet").schema.treeString)
    |-- field1: string (nullable = true)
    |-- field2: string (nullable = true)
    |-- date: string (nullable = true)
    |-- hour: string (nullable = true)
   ```
   Is there a way to get rid of the duplicate fields `date` and `hour` from the 
parquet file? 
   Seems like during ingestion, `hudi` format is adding the partition fields 
also into the parquet file
   
   **Environment Description**
   
   * Hudi version : 0.8.0
   
   * Spark version : 3.1.1
   
   * Hive version : _
   
   * Hadoop version : _
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : no
   
   * EMR: emr-6.3.0 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] raj638111 opened a new issue #3473: Duplicate field: Partition field also available in parquet file

Reply via email to