[
https://issues.apache.org/jira/browse/HUDI-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053732#comment-17053732
]
Prashant Wason commented on HUDI-667:
-------------------------------------
The fix is to maintain a minKey and maxKey within the HoodieTestDataGenerator
class.
# To find random records, we generate a random integer in range [minKey,
maxKey] and verify that this index actually exists in the HashMap
# To insert new records, we always insert at maxKey and increment maxKey
# We update minKey / maxKey during deletions (if required)
> HoodieTestDataGenerator does not delete keys correctly
> ------------------------------------------------------
>
> Key: HUDI-667
> URL: https://issues.apache.org/jira/browse/HUDI-667
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Reporter: Prashant Wason
> Priority: Minor
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> HoodieTestDataGenerator is used to generate sample data for unit-tests. It
> allows generating HoodieRecords for insert/update/delete. It maintains the
> record keys in a HashMap.
> private final Map<Integer, KeyPartition> existingKeys;
> There are two issues in the implementation:
> # Delete from existingKeys uses KeyPartition rather than Integer keys
> # Inserting records after deletes is not correctly handled
> The implementation uses the Integer key so that values can be looked up
> randomly. Assume three values were inserted, then the HashMap will hold:
> 0 -> KeyPartition1
> 1 -> KeyPartition2
> 2 -> KeyPartition3
> Now if we delete KeyPartition2 (generate a random record for deletion), the
> HashMap will be:
> 0 -> KeyPartition1
> 2 -> KeyPartition3
>
> Now if we issue a insertBatch() then the insert is
> existingKeys.put(existingKeys.size(), KeyPartition3) which will overwrite the
> KeyPartition3 already in the map rather than actually inserting a new entry
> in the map.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)