vinothchandar commented on issue #2586: URL: https://github.com/apache/hudi/issues/2586#issuecomment-789154180
>So that might not solve the issue entirely for long running queries. Is there a different approach we could look into? Like any caching mechanism? I wonder if this issue can be mitigated in your code by simply issuing a `df.cache()` ? That way the recomputation of the dataframe is not triggered, even if the cleaning policy on the writer side, deletes some older files? I am fairly confident that it might work, but of course, comes at the cost of additional/memory and storage. > we're looking into improving reader's speed with combination of increasing retention version value. the metadata table we added in 0.7.0, should help alleviate concerns around listing larger partitions. Although, we have added support for Hive/SparkSQL-on-Hive only for now. We are working on support for Spark datasource. >if the table is partitioned into 200 folders or 1000 folders, by choosing different columns, In general, the more folders, the smaller each file. So there will be some degradation (hudi or not). w.r.t partitions,I think it boils down to how S3 rate limits per prefix, more prefixes may actually help increase parallelism. In all, you want to do fast incremental updates with long retention like few hours at-least (so long running jobs can finish), but your problem is query perf degrades if you say have cleaner retention for last 10 hours? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
