vinothchandar commented on issue #2586:
URL: https://github.com/apache/hudi/issues/2586#issuecomment-789154180


   >So that might not solve the issue entirely for long running queries. Is 
there a different approach we could look into? Like any caching mechanism?
   
   I wonder if this issue can be mitigated in your code by simply issuing a 
`df.cache()` ? That way the recomputation of the dataframe is not triggered, 
even if the cleaning policy on the writer side, deletes some older files? I am 
fairly confident that it might work, but of course, comes at the cost of 
additional/memory and storage. 
   
   > we're looking into improving reader's speed with combination of increasing 
retention version value.
   
   the metadata table we added in 0.7.0, should help alleviate concerns around 
listing larger partitions. Although, we have added support for 
Hive/SparkSQL-on-Hive only for now. We are working on support for Spark 
datasource. 
   
   >if the table is partitioned into 200 folders or 1000 folders, by choosing 
different columns, 
   In general, the more folders, the smaller each file. So there will be some 
degradation (hudi or not). w.r.t partitions,I think it boils down to how S3 
rate limits per prefix, more prefixes may actually help increase parallelism. 
   
   In all, you want to do fast incremental updates with long retention like few 
hours at-least (so long running jobs can finish), but your problem is query 
perf degrades if you say have cleaner retention for last 10 hours? 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to