Rap70r commented on issue #2586:
URL: https://github.com/apache/hudi/issues/2586#issuecomment-785941366


   Hi nsivabalan,
   
   Thank you for your reply.
   
   * Incremental updates include both inserts and updates. Mostly updates.
   * We can try increasing retention version to a higher value and improve 
readers time.
   * We would prefer sticking with COPY_ON_WRITE for now.
   
   I was wondering if we should look into table caching in Spark: 
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html
   
   As this would cache the entire table into disk/memory and would work with 
that. The only downside I can think of is space issues. Are there any other 
disadvantages when using cache and persist?
   
   Also, we're looking into improving reader's speed with combination of 
increasing retention version value. When reading a S3 Hudi dataset structure, 
does the number of partition affect the speed of readers? For example, if the 
table is partitioned into 200 folders or 1000 folders, by choosing different 
columns, would that affect the speed when reading the table by using Snapshot 
query: https://hudi.apache.org/docs/querying_data.html#spark-snap-query
   
   Thank you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to