hudi-bot opened a new issue, #15515: URL: https://github.com/apache/hudi/issues/15515
Originally reported by the user: [https://github.com/apache/hudi/issues/6137] Crux of the issue is that Databricks's DBR runtime diverges from OSS Spark, and in that case `FileStatusCache` API is very clearly divergent b/w the two. There are a few approaches we can take: # Avoid reliance on Spark's FIleStatusCache implementation altogether and rely on our own one # Apply more staggered approach where we first try to use Spark's FileStatusCache and if it doesn't match expected API, we fallback to our own impl Approach # 1 would actually mean that we're not sharing cache implementation w/ Spark, which in turn would entail that in some cases we might be keeping 2 instances of the same cache. Approach # 2 remediates that and allows us to only fallback in case API is not compatible. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-5092 - Type: Bug - Affects version(s): - 0.12.0 - Fix version(s): - 1.1.0 - Attachment(s): - 24/Jan/23 20:02;guoyihua;image (1).png;https://issues.apache.org/jira/secure/attachment/13054793/image+%281%29.png - 24/Jan/23 20:02;guoyihua;image.png;https://issues.apache.org/jira/secure/attachment/13054792/image.png --- ## Comments 24/Jan/23 20:02;guoyihua;After HUDI-5104, with {{hoodie.file.index.enable=false}} , Spark datasource read with base path still does not work as per user, but it works with glob paths (0.12.2 Hudi and Databricks 11.3 (spark 3.3)).;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
