[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

GitBox Wed, 06 Jan 2021 21:35:43 -0800


wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-755894869



   > It is indeed a MOR table.Can you check your driver logs. You might find 
some exceptions around registering _rt table. You can look for logs around the 
log message
   > 
   > "Trying to sync hoodie table "
   
   error as follows:
   there is no sql about  creating _rt table , only _ro table
   ```
   
----------------------------------------------------------------------------------------------------------------------------------------------
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 371
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 337
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 404
   21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 
bigdatadev03:18850 in memory (size: 73.0 KB, free: 2.5 GB)
   21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 
bigdatadev02:6815 in memory (size: 73.0 KB, free: 3.5 GB)
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 333
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 418
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 385
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 410
   21/01/07 13:23:08 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
   21/01/07 13:23:08 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
   21/01/07 13:23:10 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
   21/01/07 13:23:10 INFO Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
   21/01/07 13:23:10 INFO MetaStoreDirectSql: Using direct SQL, underlying DB 
is MYSQL
   21/01/07 13:23:10 INFO ObjectStore: Initialized ObjectStore
   21/01/07 13:23:11 INFO HiveMetaStore: Added admin role in metastore
   21/01/07 13:23:11 INFO HiveMetaStore: Added public role in metastore
   21/01/07 13:23:11 INFO HiveMetaStore: No user is added in admin role, since 
config is empty
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_all_databases
   21/01/07 13:23:11 INFO audit: ugi=root       ip=unknown-ip-addr      
cmd=get_all_databases   
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=default pat=*
   21/01/07 13:23:11 INFO audit: ugi=root       ip=unknown-ip-addr      
cmd=get_functions: db=default pat=*     
   21/01/07 13:23:11 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as 
"embedded-only" so does not have its own datastore table.
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=dw pat=*
   21/01/07 13:23:11 INFO audit: ugi=root       ip=unknown-ip-addr      
cmd=get_functions: db=dw pat=*  
   21/01/07 13:23:11 INFO HiveSyncTool: Trying to sync hoodie table 
api_trade_ro with base path /data/stream/mor/api_trade of type MERGE_ON_READ
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
   21/01/07 13:23:11 INFO audit: ugi=root       ip=unknown-ip-addr      
cmd=get_table : db=ads tbl=api_trade_ro 
   21/01/07 13:23:11 INFO HoodieHiveClient: Found the last compaction commit as 
Option{val=null}
   21/01/07 13:23:11 INFO HoodieHiveClient: Found the last delta commit 
Option{val=[20210107132154__deltacommit__COMPLETED]}
   21/01/07 13:23:12 INFO HoodieHiveClient: Reading schema from 
/data/stream/mor/api_trade/dt=2021-01/350a9a01-538c-4a7e-8c17-09d2cdc85073-0_0-20-85_20210107132154.parquet
   21/01/07 13:23:12 INFO HiveSyncTool: Hive table api_trade_ro is not found. 
Creating it
   21/01/07 13:23:12 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL 
TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, 
`_hoodie_commit_seqno` string, `_hoodie_record_key` string, 
`_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, 
`kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, 
`current_time` string, `kafka_key` string, `kafka_value` string,  `modified` 
string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW 
FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION '/data/stream/mor/api_trade'
   21/01/07 13:23:12 INFO HoodieHiveClient: Executing SQL CREATE EXTERNAL TABLE 
 IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, 
`_hoodie_commit_seqno` string, `_hoodie_record_key` string, 
`_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, 
`kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, 
`current_time` string, `kafka_key` string, `kafka_value` string,  `modified` 
string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW 
FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION '/data/stream/mor/api_trade'
   21/01/07 13:24:09 INFO HiveSyncTool: Schema sync complete. Syncing 
partitions for api_trade_ro
   21/01/07 13:24:09 INFO HiveSyncTool: Last commit time synced was found to be 
null
   21/01/07 13:24:09 INFO HoodieHiveClient: Last commit time synced is not 
known, listing all partitions in /data/stream/mor/api_trade,FS 
:DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-398057260_1, ugi=root 
(auth:SIMPLE)]]
   21/01/07 13:24:09 INFO HiveSyncTool: Storage partitions scan complete. Found 
1
   21/01/07 13:24:09 INFO HiveMetaStore: 0: get_partitions : db=ads 
tbl=api_trade_ro
   21/01/07 13:24:09 INFO audit: ugi=root       ip=unknown-ip-addr      
cmd=get_partitions : db=ads tbl=api_trade_ro    
   21/01/07 13:24:10 INFO HiveSyncTool: New Partitions [dt=2021-01]
   21/01/07 13:24:10 INFO HoodieHiveClient: Adding partitions 1 to table 
api_trade_ro
   21/01/07 13:24:10 INFO HoodieHiveClient: Executing SQL ALTER TABLE 
`ads`.`api_trade_ro` ADD IF NOT EXISTS   PARTITION (`dt`='2021-01') LOCATION 
'/data/stream/mor/api_trade/dt=2021-01' 
   21/01/07 13:24:33 INFO HiveSyncTool: Changed Partitions []
   21/01/07 13:24:33 INFO HoodieHiveClient: No partitions to change for 
api_trade_ro
   21/01/07 13:24:33 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
   21/01/07 13:24:33 INFO audit: ugi=root       ip=unknown-ip-addr      
cmd=get_table : db=ads tbl=api_trade_ro 
   21/01/07 13:24:33 ERROR HiveSyncTool: Got runtime exception when hive syncing
   org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last 
commit time synced to 20210107132154
        at 
org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:658)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:128)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:91)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
        at 
com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:196)
        at 
com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:163)
        at 
org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
        at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
        at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
        at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
        at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
        at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
        at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
   Caused by: NoSuchObjectException(message:ads.api_trade_ro table not found)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1808)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1778)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
        at com.sun.proxy.$Proxy41.get_table(Unknown Source)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1208)
        at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:131)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
        at com.sun.proxy.$Proxy42.getTable(Unknown Source)
        at 
org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:654)
        ... 48 more
   21/01/07 13:24:33 INFO HiveMetaStore: 0: Shutting down the object store...
   21/01/07 13:24:33 INFO audit: ugi=root       ip=unknown-ip-addr      
cmd=Shutting down the object store...   
   21/01/07 13:24:33 INFO HiveMetaStore: 0: Metastore shutdown complete.
   21/01/07 13:24:33 INFO audit: ugi=root       ip=unknown-ip-addr      
cmd=Metastore shutdown complete.        
   21/01/07 13:24:33 INFO DefaultSource: Constructing hoodie (as parquet) data 
source with options :Map(hoodie.datasource.write.insert.drop.duplicates -> 
false, hoodie.datasource.hive_sync.database -> ads, 
hoodie.insert.shuffle.parallelism -> 10, path -> /data/stream/mor/api_trade, 
hoodie.datasource.write.precombine.field -> modified, 
hoodie.datasource.hive_sync.partition_fields -> dt, 
hoodie.datasource.write.payload.class -> 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload, 
hoodie.datasource.hive_sync.partition_extractor_class -> 
org.apache.hudi.hive.MultiPartKeysValueExtractor, 
hoodie.datasource.write.streaming.retry.interval.ms -> 2000, 
hoodie.datasource.hive_sync.table -> api_trade, hoodie.index.type -> 
GLOBAL_BLOOM, hoodie.datasource.write.streaming.ignore.failed.batch -> true, 
hoodie.datasource.write.operation -> upsert, hoodie.datasource.hive_sync.enable 
-> true, hoodie.datasource.write.recordkey.field -> id, hoodie.table.name -> 
api_trade, hoodie.datasource.hive_sy
 nc.jdbcurl -> jdbc:hive2://0.0.0.0:10000, hoodie.datasource.write.table.type 
-> MERGE_ON_READ, hoodie.datasource.write.hive_style_partitioning -> true, 
hoodie.datasource.query.type -> snapshot, 
hoodie.bloom.index.update.partition.path -
   
----------------------------------------------------------------------------------------------------------------------------------------------
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Reply via email to