[
https://issues.apache.org/jira/browse/HUDI-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101034#comment-17101034
]
Yixue (Andrew) Zhu edited comment on HUDI-603 at 5/6/20, 5:33 PM:
------------------------------------------------------------------
I just started working on this, and come up with a seemingly reasonable
approach:
If some delta stream configuration enabled (continuous mode and a new config
"providerSchemaChangeSupported"), when provider schema changed, restart the
program to use the new schema.
We can use this option for a couple of reasons:
# Spark serialization of Avro record and schema is optimized when schemas are
registered before program is executed, i.e. executors are spawned by the driver.
If we refresh schema w/o recreating SparkConf, which is not supported by Spark
without restating the program, the serialization optimization would be defeated.
# It is not frequent for table schema to be updated.
By throwing exception in the DeltaSync::syncOnce(), the following Spark
configuration would restart the program:
--conf spark.yarn.max.maxAppAttempts
--conf spark.yarn.am.attemptFailuresValidityInterval
was (Author: [email protected]):
I just started working on this, and come up with a seemingly reasonable
approach:
If some delta stream configuration enabled, when provider schema changed,
restart the program to use the new schema.
We can use this option for a couple of reasons:
# Spark serialization of Avro record and schema is optimized when schemas are
registered before program is executed, i.e. executors are spawned by the driver.
If we refresh schema w/o recreating SparkConf, which is not supported by Spark
without restating the program, the serialization optimization would be defeated.
# It is not frequent for table schema to be updated.
By throwing exception in the DeltaSync::syncOnce(), the following Spark
configuration would restart the program:
--conf spark.yarn.max.maxAppAttempts
--conf spark.yarn.am.attemptFailuresValidityInterval
> HoodieDeltaStreamer should periodically fetch table schema update
> -----------------------------------------------------------------
>
> Key: HUDI-603
> URL: https://issues.apache.org/jira/browse/HUDI-603
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Components: DeltaStreamer
> Reporter: Yixue Zhu
> Assignee: Pratyaksh Sharma
> Priority: Major
> Labels: evolution, pull-request-available, schema
>
> HoodieDeltaStreamer create SchemaProvider instance and delegate to DeltaSync
> for periodical sync. However, default implementation of SchemaProvider does
> not refresh schema, which can change due to schema evolution. DeltaSync
> snapshot the schema when it creates writeClient, using the SchemaProvider
> instance or pick up from source, and the schema for writeClient is not
> refreshed during the loop of Sync.
> I think this needs to be addressed to support schema evolution fully.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)