[ 
https://issues.apache.org/jira/browse/GOBBLIN-1248?focusedWorklogId=474527&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-474527
 ]

ASF GitHub Bot logged work on GOBBLIN-1248:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 25/Aug/20 21:19
            Start Date: 25/Aug/20 21:19
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 commented on a change in pull request #3091:
URL: https://github.com/apache/incubator-gobblin/pull/3091#discussion_r476744238



##########
File path: 
gobblin-hive-registration/src/main/java/org/apache/gobblin/hive/metastore/HiveMetaStoreBasedRegister.java
##########
@@ -192,16 +193,30 @@ protected void registerPath(HiveSpec spec) throws 
IOException {
       throw new IOException(e);
     }
   }
-  //TODO: We need to find a better to get the latest schema
-  private void updateSchema(HiveSpec spec, Table table) throws IOException{
+  private void updateSchema(HiveSpec spec, Table table, HiveTable 
existingTable) throws IOException{
 
     if (this.schemaRegistry.isPresent()) {
       try (Timer.Context context = 
this.metricContext.timer(GET_AND_SET_LATEST_SCHEMA).time()) {
-        String latestSchema = 
this.schemaRegistry.get().getLatestSchema(topicName).toString();
-        
spec.getTable().getSerDeProps().setProp(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName(),
 latestSchema);
+        Schema latestSchema = (Schema) 
this.schemaRegistry.get().getLatestSchemaByTopic(topicName);

Review comment:
       According to kafka team, schema registry allows "out of order 
registration" of schemas - think of this as sorting schemas by compatibility 
instead of by timestamp. this means chronological latest is NOT what the 
registry considers latest.  I also include this information in comments to 
avoid confusing
   In addition, I update the PR to first compare the creation time with 
existing schema, if we see difference, we will then fetch latest schema to get 
the latest schema creation time. In this way, we can avoid too many calls to 
schema registry




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 474527)
    Time Spent: 50m  (was: 40m)

> Fix discrepancy between table schema and file schema
> ----------------------------------------------------
>
>                 Key: GOBBLIN-1248
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1248
>             Project: Apache Gobblin
>          Issue Type: Task
>            Reporter: Zihan Li
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Previously in streaming pipeline, to avoid race condition on metadata schema, 
> when we do hive registration, we always fetch the latest schema from Kafka 
> SchemaRegistry, since gobblin converter may change the schema, this introduce 
> discrepancy between hive table schema and real file schema, so we need a 
> better way to solve this problem



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to