codejoyan edited a comment on issue #2620: URL: https://github.com/apache/hudi/issues/2620#issuecomment-800542037
Apologies for the delay @nsivabalan Below are the answers to the questions you asked: - What constitutes your record key? - _The record key is random within a partition (store number (integer), trip number (string), bill item number (short)). However 2 of the 3 columns are numeric and sortable. Will pre-sorting help based on the use case above?_ - Does your ingestion batch contain just inserts or updates as well - _It consists 90% inserts and 10% updates._ - If updates, does it touch latest partitions or spread across all partitions equally. - _It touches mostly the latest partition._ **Few additional Questions:** **Use Case:** - The use-case is to track the user visits to different stores for making purchases. - The dataset is partitioned by region and visit date. - The record key is ComplexKeyGenerator and combination of store number (integer), trip number (string), bill item number (short). - The pre-combine key is a timestamp column. Based on the above scenario do you suggest: 1. What other partition strategy or record key strategy might be used to take advantage of bloom filter? 2. There are 2 jobs that take time. Are both related to index lookup time. Or something else is also contributing to the increased load time? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
