jainpriyansh786 opened a new issue #3714:
URL: https://github.com/apache/hudi/issues/3714
**Describe the problem you faced**
The dataframe created by spark read method with
`hoodie.datasource.read.paths` option contains duplicate records when passing
multiple read paths with same value.
This happens because spark is re-reading the same paths again and appending
the data to dataframe . However the read method before reading the data from
paths should find the unique paths or error out if the duplicate paths are
given into `hoodie.datasource.read.paths` option.
This is not a issue can be handled by user input (user should be responsible
for passing unique paths only).
However it's good to add a validation to read only unique paths.
Steps to reproduce the behavior:
hudi_partition_path =
's3a://test-temp-bucket/unit=123,s3a://test-temp-bucket/unit=123'
df = spark.read.format("hudi").option('hoodie.datasource.read.paths',
hudi_partition_path).load()
This produces duplicate records in the df because the read method does not
find unique paths.
**Expected behavior**
When duplicate paths are provided the read method should find unique paths
and then append/create records into dataframe.
**Environment Description**
* Hudi version : 0.8.0
* Storage (HDFS/S3/GCS..) : s3
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]