raj638111 opened a new issue #3473:
URL: https://github.com/apache/hudi/issues/3473
**Describe the problem you faced**
When querying the hudi table (I am querying as `parquet` format) from
spark-shell, I am getting the following warning
```
spark.read.parquet("s3://bucket1/huditable1").where("date = '20210101' and
hour = '01' and field1 = 'somevalue' ")
WARN DataSource: Found duplicate column(s) in the data schema and the
partition schema: `date`, `hour`
```
On a close inspection, found that the parquet file also contains the same
fields (ie `date` and `hour`)
```
println(spark.read.parquet("s3://bucket1/huditable1/date=20210101/hour=01/file1.parquet").schema.treeString)
|-- field1: string (nullable = true)
|-- field2: string (nullable = true)
|-- date: string (nullable = true)
|-- hour: string (nullable = true)
```
Is there a way to get rid of the duplicate fields `date` and `hour` from the
parquet file?
Seems like during ingestion, `hudi` format is adding the partition fields
also into the parquet file
**Environment Description**
* Hudi version : 0.8.0
* Spark version : 3.1.1
* Hive version : _
* Hadoop version : _
* Storage (HDFS/S3/GCS..) : s3
* Running on Docker? (yes/no) : no
* EMR: emr-6.3.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]