bvaradar commented on issue #1913: URL: https://github.com/apache/hudi/issues/1913#issuecomment-670015540
@luffyd : I spent some time trying to understand your use-case. To your question : Hudi needs to list partitions in-order to figure out the list of valid files that constitute latest snapshot. It looks like your use-case is such that you are writing to a lot of partitions and hudi needs to list all of them to perform the write. I did check the code and I don't think the leak is coming from Hudi. Can you look at the parquet version being used in your runtime as @Ares-W suggested. On a different note, Regarding the looping, Are you writing the same data to hudi again and again ? If not, have you considered looking at Spark Structured streaming. I do see occasional compactions. With latest master, we have added async compaction support for structured streaming. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
