nsivabalan commented on issue #3418:
URL: https://github.com/apache/hudi/issues/3418#issuecomment-894861016


   sorry, if you use NonPartitionedKeyGen, hudi assumes empty string ("") as 
partition field for all records. Hence you will find duplicates if you are 
checking for pair of (record key, partition path). 
   I would recommend using SimpleKeyGen (its default keyGen, so you don't need 
to set one). 
   Let me know how it goes. 
   
   Here is some thought on deciding partition path. 
   Many companies usually have lot of queries around datestr. (select a,b,c 
from table where datestr >= x and datestr <= y).
   When the query hits hudi, if there are 2000 partitions, and datestr range 
passed in the query is only for past 7 days, hudi looks into only past 7 days. 
   Also during upsert, hudi does an indexing action to know whether a record is 
being updated or is a new insert. And so, w/ partitioned dataset, search space 
for a record (record key, partition path) is bounded. If not, for every record 
key, hudi has to search entire dataset and hence you could see higher latencies 
as well(if you go w/ non-partitioned). 
   
   Hope you get an idea how partitioning helps in keeping both your write and 
read latencies lower.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to