[
https://issues.apache.org/jira/browse/GOBBLIN-1485?focusedWorklogId=616295&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-616295
]
ASF GitHub Bot logged work on GOBBLIN-1485:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 29/Jun/21 13:41
Start Date: 29/Jun/21 13:41
Worklog Time Spent: 10m
Work Description: ZihanLi58 commented on a change in pull request #3324:
URL: https://github.com/apache/gobblin/pull/3324#discussion_r660155720
##########
File path:
gobblin-hive-registration/src/main/java/org/apache/gobblin/hive/HiveRegistrationUnitComparator.java
##########
@@ -142,12 +145,24 @@ public T compareIsStoredAsSubDirs() {
return (T) this;
}
+ private State extractSchemaVersion(State state) {
+ State newState = new State(state);
+ String schemaFromState =
state.getProp(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName());
+ if (schemaFromState != null && !schemaFromState.isEmpty()) {
+ String schemaVersion = AvroUtils.getSchemaCreationTime(new
Schema.Parser().parse(schemaFromState));
+ if (schemaVersion != null && !schemaVersion.isEmpty()) {
+
newState.removeProp(AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL.getPropName());
Review comment:
The whole schema may contains special character, so when comparing
whether we need to update the table, we remove this prop, but it's a new state
object, so we are not removing this prop from the hive unit.
##########
File path:
gobblin-hive-registration/src/main/java/org/apache/gobblin/hive/orc/HiveOrcSerDeManager.java
##########
@@ -75,12 +78,17 @@
public static final String DEFAULT_SERDE_TYPE = "ORC";
public static final String INPUT_FORMAT_CLASS_KEY =
"hiveOrcSerdeManager.inputFormatClass";
public static final String DEFAULT_INPUT_FORMAT_CLASS =
OrcInputFormat.class.getName();
+ public static final String WRITER_LATEST_SCHEMA = "writer.latest.schema";
public static final String OUTPUT_FORMAT_CLASS_KEY =
"hiveOrcSerdeManager.outputFormatClass";
public static final String DEFAULT_OUTPUT_FORMAT_CLASS =
OrcOutputFormat.class.getName();
public static final String HIVE_SPEC_SCHEMA_READING_TIMER =
"hiveOrcSerdeManager.schemaReadTimer";
+ public static final String HIVE_SPEC_SCHEMA_FROM_WRITER =
"hiveOrcSerdeManager.getSchemaFromWriterSchema";
Review comment:
There are some use case that we have writer schema set but it's not the
schema for this topic. i.e. in metadata pipeline, writer schema is the schema
for GMCE. If we enable this feature for all, we'll hit issue in this case.
##########
File path:
gobblin-hive-registration/src/main/java/org/apache/gobblin/hive/orc/HiveOrcSerDeManager.java
##########
@@ -264,7 +272,18 @@ private void addSchemaProperties(Path path,
HiveRegistrationUnit hiveUnit)
*
*/
protected void addSchemaPropertiesHelper(Path path, HiveRegistrationUnit
hiveUnit) throws IOException {
- TypeInfo schema = getSchemaFromLatestFile(path, this.fs);
+ TypeInfo schema;
+ if(props.getPropAsBoolean(HIVE_SPEC_SCHEMA_FROM_WRITER,
DEFAULT_HIVE_SPEC_SCHEMA_FROM_WRITER)) {
+ try {
+ Preconditions.checkArgument(props.contains(WRITER_LATEST_SCHEMA));
+ Schema avroSchema = new
Schema.Parser().parse(props.getProp(WRITER_LATEST_SCHEMA));
+ schema = TypeInfoUtils.getTypeInfoFromObjectInspector(new
AvroObjectInspectorGenerator(avroSchema).getObjectInspector());
Review comment:
Yeah I was trying to do that. Several reasons here:
1. AvroOrcSchemaConverter is now defined in gobblin-orc module, I don't
think it make sense for us to introduce new dependency for hive registration
module.
2. It's doable to transfer TypeDescription to TypeInfo, but it's the same
way that we need to use OrcUtils to create one objectInspector and get typeInfo
there. As we are using the writer schema to get orcSchema, I think the two
results should be the same? I verified one table and it looks good to me.
What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 616295)
Time Spent: 3.5h (was: 3h 20m)
> Enable feature to get schema from writer schema when do hive registration
> -------------------------------------------------------------------------
>
> Key: GOBBLIN-1485
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1485
> Project: Apache Gobblin
> Issue Type: New Feature
> Reporter: Zihan Li
> Priority: Major
> Time Spent: 3.5h
> Remaining Estimate: 0h
>
> Enable feature to get schema from writer schema when do hive registration, so
> that we can avoid list operations to get the latest schema
--
This message was sent by Atlassian Jira
(v8.3.4#803005)