[GitHub] [hudi] kk17 opened a new issue, #5861: [SUPPORT] Hudi spark datasource error after migrate from 0.8 to 0.11

GitBox Mon, 13 Jun 2022 23:50:16 -0700


kk17 opened a new issue, #5861:
URL: https://github.com/apache/hudi/issues/5861


   **Describe the problem you faced**
   
   after I update hudi to 0.11 from 0.8, using `spark.table(fullTableName)` to 
read a hudi table is not working, the table has been sync to hive metastore and 
spark is connected to the metastore. the error is
   ```
   org.sparkproject.guava.util.concurrent.UncheckedExecutionException: 
org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
'hoodie.datasource.read.paths' , default: null description: Comma separated 
list of file paths to read within a Hudi table. since version: version is not 
defined deprecated after: version is not defined)' or both must be specified.
   at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
   at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
   at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.
   
   ...
   
   Caused by: org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
'hoodie.datasource.read.paths' , default: null description: Comma separated 
list of file paths to read within a Hudi table. since version: version is not 
defined deprecated after: version is not defined)' or both must be specified.
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:78)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
        at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:261)
        at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
        at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
        at 
org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
        at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
        at 
org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. using hudi  0.8 to create a hudi table and sync to hive metastore using 
hive jdbc sync mode
   2. update hudi to 0.11
   3. add a new column to the table and sync to hive metastore using hive jdbc 
sync mode
   4. read the table using `spark.table`
   
   **Expected behavior**
   
   reading the table should be ok.
   
   **Environment Description**
   
   * Hudi version : 0.11
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.1.2
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   we are using hive jdbc sync mode to sync hudi table to hive metastore. 
before we upgrade hudi to 0.11, we will get error for  show create table 
command.  after we upgrade hudi to 0.11, we add one new column to the table. 
the error happen after we add the new column. I run show create table  using 
spark-sql after the error, the command run successful, but the return create 
table statement is without a location. I also run hive sql, both show create 
table and select statement is ok.
   
   here are more information. we are using hive jdbc sync mode to sync hudi 
table to hive metastore. before we upgrade hudi to 0.11, we will get error for  
show create table command.  after we upgrade hudi to 0.11, we add one new 
column to the table. the error happen after we add the new column. I run show 
create table  using spark-sql after the error, the command run successful, but 
the return create table statement is without a location. I also run hive sql, 
both show create table and select statement is ok.
   
   after I drop the hive table and rerun hive sync, it is ok
   
   
   before hive sync rerun
   ```
   spark-sql> show create table ods.track_signup;
   CREATE TABLE `ods`.`track_signup` (
     `_hoodie_commit_time` STRING,
     `_hoodie_commit_seqno` STRING,
     `_hoodie_record_key` STRING,
     `_hoodie_partition_path` STRING,
     `_hoodie_file_name` STRING,
     `act` STRING,
     `time` BIGINT,
     `env` STRING,
     `id` STRING,
     `seer_time` STRING,
     `hh` STRING,
     `app_id` INT,
     `ip` STRING,
     `g` STRING,
     `u` STRING,
     `ga_id` STRING,
     `app_version` STRING,
     `platform` STRING,
     `url` STRING,
     `referer` STRING,
     `medium` STRING,
     `source` STRING,
     `campaign` STRING,
     `stage` STRING,
     `content` STRING,
     `term` STRING,
     `lang` STRING,
     `su` STRING,
     `campaign_track_id` STRING,
     `last_component_id` STRING,
     `regSourceId` STRING,
     `dt` STRING)
   USING hudi
   PARTITIONED BY (dt)
   TBLPROPERTIES (
     'bucketing_version' = '2',
     'last_modified_time' = '1655107146',
     'last_modified_by' = 'hive',
     'last_commit_time_sync' = '20220613152622014')
   ```
   after hive sync rerun
   ```
   spark-sql> show create table ods.track_signup;
   CREATE TABLE `ods`.`track_signup` (
     `_hoodie_commit_time` STRING,
     `_hoodie_commit_seqno` STRING,
     `_hoodie_record_key` STRING,
     `_hoodie_partition_path` STRING,
     `_hoodie_file_name` STRING,
     `act` STRING COMMENT 'xxx',
     `time` BIGINT COMMENT 'xxx',
     `env` STRING COMMENT 'xxx',
     `id` STRING COMMENT 'xxx',
     `seer_time` STRING COMMENT 'xxx',
     `hh` STRING,
     `app_id` INT COMMENT 'xxx',
     `ip` STRING COMMENT 'xxx',
     `g` STRING COMMENT 'xxx',
     `u` STRING COMMENT 'xxx',
     `ga_id` STRING COMMENT 'xxx',
     `app_version` STRING COMMENT 'xxx',
     `platform` STRING COMMENT 'xxx',
     `url` STRING COMMENT 'xxx',
     `referer` STRING COMMENT 'xxx',
     `medium` STRING COMMENT 'xxx',
     `source` STRING COMMENT 'xxx',
     `campaign` STRING COMMENT 'xxx',
     `stage` STRING COMMENT 'xxx',
     `content` STRING COMMENT 'xxx',
     `term` STRING COMMENT 'xxx',
     `lang` STRING COMMENT 'xxx',
     `su` STRING COMMENT 'xxx',
     `campaign_track_id` STRING COMMENT 'xxx',
     `last_component_id` STRING COMMENT 'xxx',
     `regSourceId` STRING,
     `dt` STRING)
   USING hudi
   OPTIONS (
     `hoodie.query.as.ro.table` 'false')
   PARTITIONED BY (dt)
   LOCATION 's3://xxxx/track_signup'
   TBLPROPERTIES (
     'bucketing_version' = '2',
     'last_modified_time' = '1655134599',
     'last_modified_by' = 'hive',
     'last_commit_time_sync' = '20220613153932664')
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kk17 opened a new issue, #5861: [SUPPORT] Hudi spark datasource error after migrate from 0.8 to 0.11

Reply via email to