[
https://issues.apache.org/jira/browse/HUDI-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raymond Xu updated HUDI-5609:
-----------------------------
Component/s: spark-sql
> Hudi table not queryable by SQL on Databricks Spark
> ---------------------------------------------------
>
> Key: HUDI-5609
> URL: https://issues.apache.org/jira/browse/HUDI-5609
> Project: Apache Hudi
> Issue Type: Improvement
> Components: spark-sql
> Reporter: Ethan Guo
> Assignee: Ethan Guo
> Priority: Blocker
> Fix For: 0.13.1, 0.12.3
>
>
> Customer: I’ve tried this with 0.12.2 and still receive the same error. does
> the table format version also need to be updated? i.e. we’re writing with
> Hudi 0.11.1 using EMR but reading from Databricks using Hudi 0.12.2 and Spark
> 3.3.
>
> What have been tried so far on 0.12.2:
> #
> !https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/[email protected]!
> SparkSQL
> so just tried Spark SQL and doesn’t work (different issue)
> SET hoodie.file.index.enable=false
> select count(*) from validated_sales;
> returns 0 count but no errors
> 2.
> !https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/[email protected]!
> when running via pyspark
> %python
> df = spark.read.format('hudi')\
> .load('s3://<bucket>/validated_sales/*/*/*')
> df.count()
> all is good with 0.12.2 Hudi and Databricks 11.3 (spark 3.3).
> 3.
> !https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/[email protected]!
> without the wildcard in pyspark
> %python
> df = spark.read.format('hudi')\
> .load('s3://<bucket>/validated_sales')
> df.count()
> count = 0
> 4.
> !https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/[email protected]!
> without wildcard but with recursive option set in pyspark
> %python
> df = spark.read.format('hudi')\
> .option("recursiveFileLookup","true")\
> .load('s3://<bucket>/validated_sales')
> df.count()
> count = 250k
--
This message was sent by Atlassian Jira
(v8.20.10#820010)