Rap70r commented on issue #2586: URL: https://github.com/apache/hudi/issues/2586#issuecomment-785941366
Hi nsivabalan, Thank you for your reply. * Incremental updates include both inserts and updates. Mostly updates. * We can try increasing retention version to a higher value and improve readers time. * We would prefer sticking with COPY_ON_WRITE for now. I was wondering if we should look into table caching in Spark: https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html As this would cache the entire table into disk/memory and would work with that. The only downside I can think of is space issues. Are there any other disadvantages when using cache and persist? Also, we're looking into improving reader's speed with combination of increasing retention version value. When reading a S3 Hudi dataset structure, does the number of partition affect the speed of readers? For example, if the table is partitioned into 200 folders or 1000 folders, by choosing different columns, would that affect the speed when reading the table by using Snapshot query: https://hudi.apache.org/docs/querying_data.html#spark-snap-query Thank you ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
