[I] [Bug] [LocalFile] Bug LocalFile Model uses spark local mode to double the data [seatunnel]

via GitHub Fri, 17 May 2024 03:12:11 -0700


AdkinsHan opened a new issue, #6868:
URL: https://github.com/apache/seatunnel/issues/6868


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22)
 and found no similar issues.
   
   
   ### What happened
   
   When I used spark local mode to read the local csv file into the hive table, 
the data was multiplied by 3N times, but this did not happen when I used spark 
yarn mode. Because I used seatunnnel 1.5 before, the migration process was 
local, but when I tested version 2.3.5, the data was doubled.
   summary :  
    --master **local**   --deploy-mode **client**  3 times
    --master **yarn**  --deploy-mode **client**  3 times
    --master **yarn** --deploy-mode **cluster**  right
   I have 2076 in my cvs file ,but select count(1) from xx then shows 3*2076
   
   ### SeaTunnel Version
   
   2.3.5
   
   ### SeaTunnel Config
   
   ```conf
   env {
     # seatunnel defined streaming batch duration in seconds
     execution.parallelism = 4
     job.mode = "BATCH"
     spark.executor.instances = 4
     spark.executor.cores = 4
     spark.executor.memory = "4g"
     spark.sql.catalogImplementation = "hive"
     spark.hadoop.hive.exec.dynamic.partition = "true"
     spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
   }
   
   source {
       LocalFile {
       schema {
               fields {
                     sku = string
                     sku_group = string
                     pb = string
                     series = string
                     pn = string
                     mater_n = string
                   }
       }
         path = 
"/data/ghyworkbase/uploadfile/h019-ods_file_pjp_old_new_sku_yy.csv"
         file_format_type = "csv"
         skip_header_row_number=1
         result_table_name="ods_file_pjp_old_new_sku_yy_source"
       }
   }
   
   transform {
     Sql {
       source_table_name="ods_file_pjp_old_new_sku_yy_source"
       query = "select 
sku,sku_group,pb,series,pn,mater_n,TO_CHAR(CURRENT_DATE(),'yyyy') as dt_year 
from ods_file_pjp_old_new_sku_yy_source "
       result_table_name="ods_file_pjp_old_new_sku_yy"
   
     }
   }
   
   sink {
   
   #   Console {
   #      source_table_name = "ods_file_pjp_old_new_sku_yy"
   #    }
   
      Hive {
        source_table_name="ods_file_pjp_old_new_sku_yy"
        table_name = "ghydata.ods_file_pjp_old_new_sku_yy"
        metastore_uri = "thrift://"
      }
   
   }
   ```
   
   
   ### Running Command
   
   ```shell
   sh 
/data/seatunnel/seatunnel-2.3.4/bin/start-seatunnel-spark-3-connector-v2.sh \
     --master local \
     --deploy-mode client \
     --queue ghydl \
     --executor-instances 4 \
     --executor-cores 4 \
     --executor-memory 4g \
     --name "h019-ods_file_pjp_old_new_sku_yy" \
     --config /2.3.5/h019-ods_file_pjp_old_new_sku_yy.conf
   ```
   
   
   ### Error Exception
   
   ```log
   nothing but data 3*
   ```
   
   
   ### Zeta or Flink or Spark Version
   
   _No response_
   
   ### Java or Scala Version
   
   /usr/local/jdk/jdk1.8.0_341
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug] [LocalFile] Bug LocalFile Model uses spark local mode to double the data [seatunnel]

Reply via email to