davidzollo opened a new issue, #9294: URL: https://github.com/apache/seatunnel/issues/9294
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22) and found no similar issues. ### What happened When using file as source type, if there is a folder with the same name as the defined file name in the file path (for example, there is an A.csv and an A folder in the /data/ path), then the file A.csv will be read twice. ### Problem Description When using the `LocalFile` connector as a data source, if the specified path contains both a file and a folder with the same name, the file will be read twice by the SeaTunnel engine, resulting in duplicated data. For example, if both a file `A.csv` and a directory `A` exist under the path `/data/`, then the file `A.csv` will be read twice, leading to duplicated data being written to the target. ### Steps to Reproduce 1. Prepare the test environment: under the specified path (e.g., `/data/whale_ops/flie/`), create a CSV file named `BDC_ceshi1298.csv` and a directory named `BDC_ceshi1298`. <img width="555" alt="Image" src="https://github.com/user-attachments/assets/17dfba5b-4d13-4dc3-a36f-54aabb96c1f0" /> [BDC_ceshi1298.csv](https://github.com/user-attachments/files/20103272/BDC_ceshi1298.csv) 3. Run the SeaTunnel job with the following configuration: ``` env { "job.name"="testfile" "job.mode"="BATCH" } source { LocalFile { file_filter_pattern = "BDC_ceshi1298.csv" file_format_type = "CSV" data_save_mode = "CSV" delimiter = "," read_columns = ["n1_b_ak", "n2_bak", "amm_bak", "remarks"] schema { columns=[ { name="n1_b_ak" type=string "nullable"=false }, { name="n2_bak" type=string "nullable"=false }, { name="amm_bak" type=string "nullable"=false }, { name=remarks type=string "nullable"=false } ] } path="/data/whale_ops/flie/" "skip_header_row_number"="1" encoding= "UTF-8" } } sink { Jdbc { driver = "org.postgresql.Driver" url = "jdbc:postgresql://xxx:5432/qa_sink" user = "postgres" password = "postgres" generate_sink_sql = "true" enable_upsert = "true" is_primary_key_updated = "false" schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST" database = "qa_sink" table="public.testtable" data_save_mode = "APPEND_DATA" } } ``` 3. Check the target table in the target database; you'll find that the records from `BDC_ceshi1298.csv` have been written twice. the csv has 36 records, but there are 72 records in the `testtable` table. ### Expected Behavior each file should only be read once ### Actual Behavior The file is read twice, resulting in duplicate records in the target table. The issue is likely due to a flaw in the file path resolution logic of the `LocalFile` source, where both the file and the same-named directory cause the file to be matched and read multiple times. ### SeaTunnel Version 2.3.10 ### SeaTunnel Config ```conf env { "job.name"="testfile" "job.mode"="BATCH" } source { LocalFile { file_filter_pattern = "BDC_ceshi1298.csv" file_format_type = "CSV" data_save_mode = "CSV" delimiter = "," read_columns = ["n1_b_ak", "n2_bak", "amm_bak", "remarks"] schema { columns=[ { name="n1_b_ak" type=string "nullable"=false }, { name="n2_bak" type=string "nullable"=false }, { name="amm_bak" type=string "nullable"=false }, { name=remarks type=string "nullable"=false } ] } path="/data/whale_ops/flie/" "skip_header_row_number"="1" encoding= "UTF-8" } } sink { Jdbc { driver = "org.postgresql.Driver" url = "jdbc:postgresql://xxx:5432/qa_sink" user = "postgres" password = "postgres" generate_sink_sql = "true" enable_upsert = "true" is_primary_key_updated = "false" schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST" database = "qa_sink" table="public.testtable" data_save_mode = "APPEND_DATA" } } ``` ### Running Command ```shell ./seatunnel.sh --config ../config/test-csv-to-pg.conf -m local ``` ### Error Exception ```log The csv has 36 records, but there are 72 records in the `testtable` table. ``` ### Zeta or Flink or Spark Version zeta 2.3.10 ### Java or Scala Version jdk8 ### Screenshots _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
