jtmzheng commented on issue #7829:
URL: https://github.com/apache/hudi/issues/7829#issuecomment-1439179802

   We have a two stage pipeline:
   
   1. Snapshot of MySQL table (as parquet files)
   2. Convert to a Hudi table (ie. read in parquet, write out as Hudi table)
   
   # of rows: 154982072 
   - this is the total number of rows in the input
   
   # of duplicate rows with different record keys: 813263 
   - so in the example I gave:
   ```
   Row 1: id = 3, monotonically_increasing_id = 1
   Row 2: id = 2, monotonically_increasing_id = 2
   Row 3: id = 3, monotonically_increasing_id = 8589934593
   ```
   Row 1 should have `id = 1` but it was overwritten by Row 3 except with 
`monotonically_increasing_id = 8589934593` (ie. a duplicate of Row 3 except 
with the record key of Row 1). This was how many "duplicates" we found in our 
input.
   
   It was easy to confirm this by comparing against the parquet input.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to