[GitHub] [hudi] yyh2954360585 opened a new issue, #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data

via GitHub Thu, 17 Aug 2023 20:09:36 -0700


yyh2954360585 opened a new issue, #9471:
URL: https://github.com/apache/hudi/issues/9471


   **Describe the problem you faced**
          Q1:
            Assuming the source table order table has a total data volume of 5 
million. Synchronize using deltasteamer JdbcSource
            Hudi conf:
            ` --hoodie-conf hoodie.deltastreamer.jdbc.incr.pull=true` 
            `--hoodie-conf 
hoodie.deltastreamer.jdbc.table.incr.column.name=update_date`
           `--source-limit 100000`
           `--continuous`
            When deltasteamer synchronizes to 40w data, the current 
lastCheckpoint=2023-08-17 14:55 0:00:00 So the SQL for 
             incrementalFetch Method to query source data is:    
             `select (select * from order where update_date>"2023-08-17 14:55 
0:00:00" order by update_date limit 100000) rdbms_table`
            Assuming that there is 200000 data in the updateDate field of my 
order table, which is equal to "2023-08-17 14:55 1:00:000" will 
             only obtain 100000 rows of data due to sourceLimit=100000, and 
will also lose 100000 rows of data.
          
          Q2:
              Why are these two parameters set?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.13.1
   
   * Spark version :3.2.1
   
   * Hive version :3.1.3
   
   * Hadoop version :3.3.3
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yyh2954360585 opened a new issue, #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data

Reply via email to