jainpriyansh786 opened a new issue #3714:
URL: https://github.com/apache/hudi/issues/3714


   **Describe the problem you faced**
   
   The dataframe created by spark read method with 
`hoodie.datasource.read.paths` option contains duplicate records when passing 
multiple read paths with same value.
   
   This happens because spark is re-reading the same paths again and appending 
the data to dataframe . However the read method before reading the data from 
paths should find the unique paths or error out if the duplicate paths are 
given into `hoodie.datasource.read.paths` option.
   
   This is not a issue can be handled by user input (user should be responsible 
for passing unique paths only).
   However it's good to add a validation to read only unique paths.
   
   Steps to reproduce the behavior:
   hudi_partition_path = 
's3a://test-temp-bucket/unit=123,s3a://test-temp-bucket/unit=123'
   df = spark.read.format("hudi").option('hoodie.datasource.read.paths', 
hudi_partition_path).load()
   
   This produces duplicate records in the df because the read method does not 
find unique paths.
   **Expected behavior**
   When duplicate paths are provided the read method should find unique paths 
and then append/create records into dataframe.
   
   **Environment Description**
   
   * Hudi version : 0.8.0
   
   * Storage (HDFS/S3/GCS..) : s3
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to