[I] [Bug] [Connector] duplicate reading issue with LocalFile source when a folder has the same name as a file [seatunnel]

via GitHub Thu, 08 May 2025 06:04:13 -0700


davidzollo opened a new issue, #9294:
URL: https://github.com/apache/seatunnel/issues/9294


   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22)
 and found no similar issues.
   
   
   ### What happened
   
   
   When using file as source type, if there is a folder with the same name as 
the defined file name in the file path (for example, there is an A.csv and an A 
folder in the /data/ path), then the file A.csv will be read twice.
   
   ### Problem Description
   
   When using the `LocalFile` connector as a data source, if the specified path 
contains both a file and a folder with the same name, the file will be read 
twice by the SeaTunnel engine, resulting in duplicated data.
   
   For example, if both a file `A.csv` and a directory `A` exist under the path 
`/data/`, then the file `A.csv` will be read twice, leading to duplicated data 
being written to the target.
   
   ### Steps to Reproduce
   
   1. Prepare the test environment: under the specified path (e.g., 
`/data/whale_ops/flie/`), create a CSV file named `BDC_ceshi1298.csv` and a 
directory named `BDC_ceshi1298`. 
   <img width="555" alt="Image" 
src="https://github.com/user-attachments/assets/17dfba5b-4d13-4dc3-a36f-54aabb96c1f0";
 />
   
[BDC_ceshi1298.csv](https://github.com/user-attachments/files/20103272/BDC_ceshi1298.csv)
   
   3. Run the SeaTunnel job with the following configuration:
   ```
   env {
        "job.name"="testfile"
        "job.mode"="BATCH"
   }
   source {
        LocalFile {
               file_filter_pattern = "BDC_ceshi1298.csv" 
               file_format_type = "CSV" 
               data_save_mode  = "CSV" 
               delimiter   = ","
               read_columns  = ["n1_b_ak", "n2_bak", "amm_bak", "remarks"] 
               schema {
                   columns=[
                       {
                           name="n1_b_ak"
                           type=string
                           "nullable"=false
                       },
                       {
                           name="n2_bak"
                           type=string
                           "nullable"=false
                       },
                       {
                           name="amm_bak"
                           type=string
                           "nullable"=false
                       },
                       {
                           name=remarks
                           type=string
                           "nullable"=false
                       }
                       ]
                   }
               path="/data/whale_ops/flie/"
               "skip_header_row_number"="1"
               encoding= "UTF-8" 
        }
   }
   sink { 
     Jdbc { 
         driver = "org.postgresql.Driver" 
         url = "jdbc:postgresql://xxx:5432/qa_sink" 
         user = "postgres" 
         password = "postgres" 
         generate_sink_sql = "true" 
         enable_upsert  = "true" 
         is_primary_key_updated  = "false" 
         schema_save_mode  = "CREATE_SCHEMA_WHEN_NOT_EXIST" 
         database  = "qa_sink" 
         table="public.testtable" 
         data_save_mode = "APPEND_DATA" 
     } 
   }
   ```
   
   3. Check the target table in the target database; you'll find that the 
records from `BDC_ceshi1298.csv` have been written twice.  
   the csv has 36 records, but there are 72 records in the `testtable` table.
   
   
   
   ### Expected Behavior
   
   each file should only be read once
   
   ### Actual Behavior
   
   The file is read twice, resulting in duplicate records in the target table. 
The issue is likely due to a flaw in the file path resolution logic of the 
`LocalFile` source, where both the file and the same-named directory cause the 
file to be matched and read multiple times.
   
   
   
   
   
   
   
   
   
   ### SeaTunnel Version
   
   2.3.10
   
   ### SeaTunnel Config
   
   ```conf
   env {
        "job.name"="testfile"
        "job.mode"="BATCH"
   }
   source {
        LocalFile {
               file_filter_pattern = "BDC_ceshi1298.csv" 
               file_format_type = "CSV" 
               data_save_mode  = "CSV" 
               delimiter   = ","
               read_columns  = ["n1_b_ak", "n2_bak", "amm_bak", "remarks"] 
               schema {
                   columns=[
                       {
                           name="n1_b_ak"
                           type=string
                           "nullable"=false
                       },
                       {
                           name="n2_bak"
                           type=string
                           "nullable"=false
                       },
                       {
                           name="amm_bak"
                           type=string
                           "nullable"=false
                       },
                       {
                           name=remarks
                           type=string
                           "nullable"=false
                       }
                       ]
                   }
               path="/data/whale_ops/flie/"
               "skip_header_row_number"="1"
               encoding= "UTF-8" 
        }
   }
   sink { 
     Jdbc { 
         driver = "org.postgresql.Driver" 
         url = "jdbc:postgresql://xxx:5432/qa_sink" 
         user = "postgres" 
         password = "postgres" 
         generate_sink_sql = "true" 
         enable_upsert  = "true" 
         is_primary_key_updated  = "false" 
         schema_save_mode  = "CREATE_SCHEMA_WHEN_NOT_EXIST" 
         database  = "qa_sink" 
         table="public.testtable" 
         data_save_mode = "APPEND_DATA" 
     } 
   }
   ```
   
   ### Running Command
   
   ```shell
   ./seatunnel.sh --config ../config/test-csv-to-pg.conf  -m local
   ```
   
   ### Error Exception
   
   ```log
   The csv has 36 records, but there are 72 records in the `testtable` table.
   ```
   
   ### Zeta or Flink or Spark Version
   
   zeta 2.3.10
   
   ### Java or Scala Version
   
   jdk8
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug] [Connector] duplicate reading issue with LocalFile source when a folder has the same name as a file [seatunnel]

Reply via email to