codejoyan edited a comment on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-800542037


   Apologies for the delay @nsivabalan 
   Below are the answers to the questions you asked:
   - What constitutes your record key? - _The record key is random within a 
partition (store number (integer), trip number (string), bill item number 
(short)). However 2 of the 3 columns are numeric and sortable. Will pre-sorting 
help based on the use case above?_
   - Does your ingestion batch contain just inserts or updates as well - _It 
consists 90% inserts and 10% updates._
   - If updates, does it touch latest partitions or spread across all 
partitions equally. - _It touches mostly the latest partition._
   
   **Few additional Questions:**
   **Use Case:**
   
   - The use-case is to track the user visits to different stores for making 
purchases. 
   - The dataset is partitioned by region and visit date. 
   - The record key is ComplexKeyGenerator and combination of store number 
(integer), trip number (string), bill item number (short).
   - The pre-combine key is a timestamp column.
   
   Based on the above scenario do you suggest:
   1. What other partition strategy or record key strategy might be used to 
take advantage of bloom filter?
   2. There are 2 jobs that take time. Are both related to index lookup time. 
Or something else is also contributing to the increased load time?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to