[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=268126=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-268126 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 27/Jun/19 00:41 Start Date: 27/Jun/19 00:41 Worklog Time Spent: 10m Work Description: jhsenjaliya commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r297918336 ## File path: gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java ## @@ -39,32 +43,38 @@ * check if the schema matches the expected schema. If not it will abort the job. */ -public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor{ +public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor { Review comment: ok, sure. Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 268126) Time Spent: 2h 50m (was: 2h 40m) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Fix For: 0.15.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=249626=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-249626 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 28/May/19 20:37 Start Date: 28/May/19 20:37 Worklog Time Spent: 10m Work Description: asfgit commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 249626) Time Spent: 2h 40m (was: 2.5h) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Fix For: 0.15.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=249625=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-249625 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 28/May/19 20:36 Start Date: 28/May/19 20:36 Worklog Time Spent: 10m Work Description: ibuenros commented on issue #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#issuecomment-496679381 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 249625) Time Spent: 2.5h (was: 2h 20m) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=248144=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-248144 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 24/May/19 16:57 Start Date: 24/May/19 16:57 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r287440685 ## File path: gobblin-restli/gobblin-throttling-service/gobblin-throttling-service-api/src/main/snapshot/org.apache.gobblin.restli.throttling.permits.snapshot.json ## @@ -17,6 +17,18 @@ "type" : "long", "doc" : "Client should not try to acquire permits before this delay has passed.", "optional" : true +}, { Review comment: Got the point. Just mention in the description that why this changes (which is unrelated to the PR itself) shows up as part of code change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 248144) Time Spent: 2h 10m (was: 2h) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=248143=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-248143 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 24/May/19 16:57 Start Date: 24/May/19 16:57 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r287436534 ## File path: gobblin-data-management/src/main/java/org/apache/gobblin/util/schema_check/AvroSchemaCheckStrategy.java ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.gobblin.util.schema_check; + +import org.apache.avro.Schema; +import org.apache.gobblin.configuration.ConfigurationKeys; +import org.apache.gobblin.configuration.WorkUnitState; + + +/** + * The strategy to compare Avro schema. + */ +public interface AvroSchemaCheckStrategy { + /** + * A factory to initiate the Strategy + */ + class AvroSchemaCheckStrategyFactory { +/** + * Use the configuration to create a schema check strategy. If it's not found, return null. + * @param state + * @return + */ +public static AvroSchemaCheckStrategy create(WorkUnitState state) +{ + try { +return (AvroSchemaCheckStrategy) Class.forName(state.getProp(ConfigurationKeys.AVRO_SCHEMA_CHECK_STRATEGY, ConfigurationKeys.AVRO_SCHEMA_CHECK_STRATEGY_DEFAULT)).newInstance(); + } catch (Exception e) + { Review comment: nitpick: usually we would have `{` in the same line of `catch` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 248143) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244813=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244813 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 20/May/19 05:23 Start Date: 20/May/19 05:23 Worklog Time Spent: 10m Work Description: jhsenjaliya commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285431587 ## File path: gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java ## @@ -39,32 +43,38 @@ * check if the schema matches the expected schema. If not it will abort the job. */ -public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor{ +public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor { Review comment: just a comment on this class: i understand that this class already exists and its adding proper field level schema check, but can schemaCheck not be part of `FileAwareInputStreamExtractor` itself instead of introducing another class for schemaCheck functionality, since that is the only difference with this class which probably can be controlled by a schemaCheck flag. may be its a separate PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 244813) Time Spent: 1h 50m (was: 1h 40m) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244177=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244177 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 17/May/19 18:19 Start Date: 17/May/19 18:19 Worklog Time Spent: 10m Work Description: autumnust commented on issue #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#issuecomment-493550408 @ZihanLi58 Travis is failing, can you check if it is related to your changes? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 244177) Time Spent: 1h 40m (was: 1.5h) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244176=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244176 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 17/May/19 18:18 Start Date: 17/May/19 18:18 Worklog Time Spent: 10m Work Description: autumnust commented on issue #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#issuecomment-493550408 @ZihanLi58 Travis is failing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 244176) Time Spent: 1.5h (was: 1h 20m) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244136=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244136 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 17/May/19 16:54 Start Date: 17/May/19 16:54 Worklog Time Spent: 10m Work Description: ZihanLi58 commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285206877 ## File path: gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java ## @@ -39,32 +42,109 @@ * check if the schema matches the expected schema. If not it will abort the job. */ -public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor{ +public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor { - public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file, WorkUnitState state) - { + public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file, WorkUnitState state) { super(fs, file, state); } - public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file) - { + + public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file) { this(fs, file, null); } @Override - protected FileAwareInputStream buildStream(FileSystem fsFromFile) - throws DataRecordException, IOException{ -if(!schemaChecking(fsFromFile)) -{ + protected FileAwareInputStream buildStream(FileSystem fsFromFile) throws DataRecordException, IOException { +if (!schemaChecking(fsFromFile)) { throw new DataRecordException("Schema does not match the expected schema"); } return super.buildStream(fsFromFile); } - protected boolean schemaChecking(FileSystem fsFromFile) - throws IOException { + protected boolean schemaChecking(FileSystem fsFromFile) throws IOException { DatumReader datumReader = new GenericDatumReader<>(); -DataFileReader dataFileReader = new DataFileReader(new FsInput(this.file.getFileStatus().getPath(),fsFromFile), datumReader); +DataFileReader dataFileReader = +new DataFileReader(new FsInput(this.file.getFileStatus().getPath(), fsFromFile), datumReader); Schema schema = dataFileReader.getSchema(); -return schema.toString().equals(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA)); +Schema expectedSchema = new Schema.Parser().parse(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA)); + +return compare(schema, expectedSchema); + } + + private boolean compare(Schema toValidate, Schema expected) { +if (toValidate.getType() != expected.getType() || !toValidate.getName().equals(expected.getName())) {return false;} +else { + switch (toValidate.getType()) { +case NULL: +case BOOLEAN: +case INT: +case LONG: +case FLOAT: +case DOUBLE: +case BYTES: +case STRING: { + return true; +} +case ARRAY: { + return compare(toValidate.getElementType(), expected.getElementType()); +} +case MAP: { + return compare(toValidate.getValueType(), expected.getValueType()); +} +case FIXED: { + // fixed size and name must match: + if (toValidate.getFixedSize() != expected.getFixedSize()) { +return false; + } +} +case ENUM: { + // expected symbols must contain all toValidate symbols: + final Set expectedSymbols = new HashSet(expected.getEnumSymbols()); + final Set toValidateSymbols = new HashSet(toValidate.getEnumSymbols()); + if (expectedSymbols.size() != toValidateSymbols.size()) { +return false; + } + if (!expectedSymbols.containsAll(toValidateSymbols)) { +return false; + } +} + +case RECORD: { + // Check that each field of toValidate schema is in expected schema + if(toValidate.getFields().size() != expected.getFields().size()) {return false;} + for (final Schema.Field expectedFiled : expected.getFields()) { +final Schema.Field toValidateField = toValidate.getField(expectedFiled.name()); +if (toValidateField == null) { + // expected field does not correspond to any field in the toValidate record schema + return false; +} else { + if (!compare(toValidateField.schema(), expectedFiled.schema())) { +return false; + } +} + } + return true; +} +case UNION:
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244133=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244133 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 17/May/19 16:48 Start Date: 17/May/19 16:48 Worklog Time Spent: 10m Work Description: ZihanLi58 commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285204856 ## File path: gobblin-restli/gobblin-throttling-service/gobblin-throttling-service-api/src/main/snapshot/org.apache.gobblin.restli.throttling.permits.snapshot.json ## @@ -17,6 +17,18 @@ "type" : "long", "doc" : "Client should not try to acquire permits before this delay has passed.", "optional" : true +}, { Review comment: Every time I build the project, this change will be automatically made. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 244133) Time Spent: 1h 10m (was: 1h) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244125=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244125 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 17/May/19 16:30 Start Date: 17/May/19 16:30 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285198718 ## File path: gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java ## @@ -39,32 +42,109 @@ * check if the schema matches the expected schema. If not it will abort the job. */ -public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor{ +public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor { - public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file, WorkUnitState state) - { + public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file, WorkUnitState state) { super(fs, file, state); } - public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file) - { + + public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file) { this(fs, file, null); } @Override - protected FileAwareInputStream buildStream(FileSystem fsFromFile) - throws DataRecordException, IOException{ -if(!schemaChecking(fsFromFile)) -{ + protected FileAwareInputStream buildStream(FileSystem fsFromFile) throws DataRecordException, IOException { +if (!schemaChecking(fsFromFile)) { throw new DataRecordException("Schema does not match the expected schema"); } return super.buildStream(fsFromFile); } - protected boolean schemaChecking(FileSystem fsFromFile) - throws IOException { + protected boolean schemaChecking(FileSystem fsFromFile) throws IOException { DatumReader datumReader = new GenericDatumReader<>(); -DataFileReader dataFileReader = new DataFileReader(new FsInput(this.file.getFileStatus().getPath(),fsFromFile), datumReader); +DataFileReader dataFileReader = +new DataFileReader(new FsInput(this.file.getFileStatus().getPath(), fsFromFile), datumReader); Schema schema = dataFileReader.getSchema(); -return schema.toString().equals(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA)); +Schema expectedSchema = new Schema.Parser().parse(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA)); + +return compare(schema, expectedSchema); + } + + private boolean compare(Schema toValidate, Schema expected) { +if (toValidate.getType() != expected.getType() || !toValidate.getName().equals(expected.getName())) {return false;} +else { + switch (toValidate.getType()) { +case NULL: +case BOOLEAN: +case INT: +case LONG: +case FLOAT: +case DOUBLE: +case BYTES: +case STRING: { + return true; +} +case ARRAY: { + return compare(toValidate.getElementType(), expected.getElementType()); +} +case MAP: { + return compare(toValidate.getValueType(), expected.getValueType()); +} +case FIXED: { + // fixed size and name must match: + if (toValidate.getFixedSize() != expected.getFixedSize()) { +return false; + } +} +case ENUM: { + // expected symbols must contain all toValidate symbols: + final Set expectedSymbols = new HashSet(expected.getEnumSymbols()); + final Set toValidateSymbols = new HashSet(toValidate.getEnumSymbols()); + if (expectedSymbols.size() != toValidateSymbols.size()) { +return false; + } + if (!expectedSymbols.containsAll(toValidateSymbols)) { +return false; + } +} + +case RECORD: { + // Check that each field of toValidate schema is in expected schema + if(toValidate.getFields().size() != expected.getFields().size()) {return false;} + for (final Schema.Field expectedFiled : expected.getFields()) { +final Schema.Field toValidateField = toValidate.getField(expectedFiled.name()); +if (toValidateField == null) { + // expected field does not correspond to any field in the toValidate record schema + return false; +} else { + if (!compare(toValidateField.schema(), expectedFiled.schema())) { +return false; + } +} + } + return true; +} +case UNION:
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244124=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244124 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 17/May/19 16:30 Start Date: 17/May/19 16:30 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285199219 ## File path: gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java ## @@ -39,32 +42,109 @@ * check if the schema matches the expected schema. If not it will abort the job. */ -public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor{ +public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor { - public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file, WorkUnitState state) - { + public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file, WorkUnitState state) { super(fs, file, state); } - public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file) - { + + public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file) { this(fs, file, null); } @Override - protected FileAwareInputStream buildStream(FileSystem fsFromFile) - throws DataRecordException, IOException{ -if(!schemaChecking(fsFromFile)) -{ + protected FileAwareInputStream buildStream(FileSystem fsFromFile) throws DataRecordException, IOException { +if (!schemaChecking(fsFromFile)) { throw new DataRecordException("Schema does not match the expected schema"); } return super.buildStream(fsFromFile); } - protected boolean schemaChecking(FileSystem fsFromFile) - throws IOException { + protected boolean schemaChecking(FileSystem fsFromFile) throws IOException { DatumReader datumReader = new GenericDatumReader<>(); -DataFileReader dataFileReader = new DataFileReader(new FsInput(this.file.getFileStatus().getPath(),fsFromFile), datumReader); +DataFileReader dataFileReader = +new DataFileReader(new FsInput(this.file.getFileStatus().getPath(), fsFromFile), datumReader); Schema schema = dataFileReader.getSchema(); -return schema.toString().equals(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA)); +Schema expectedSchema = new Schema.Parser().parse(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA)); + +return compare(schema, expectedSchema); + } + + private boolean compare(Schema toValidate, Schema expected) { Review comment: Depending on the scope of your schema checking, this method can be reused for other purposes. May be make it static method or put in some utilities classes? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 244124) Time Spent: 0.5h (was: 20m) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244127=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244127 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 17/May/19 16:30 Start Date: 17/May/19 16:30 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285190404 ## File path: gobblin-restli/gobblin-throttling-service/gobblin-throttling-service-api/src/main/snapshot/org.apache.gobblin.restli.throttling.permits.snapshot.json ## @@ -17,6 +17,18 @@ "type" : "long", "doc" : "Client should not try to acquire permits before this delay has passed.", "optional" : true +}, { Review comment: Is this changes from other PR? Can you rebase the newest changes so that it won't be included? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 244127) Time Spent: 1h (was: 50m) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244126=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244126 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 17/May/19 16:30 Start Date: 17/May/19 16:30 Worklog Time Spent: 10m Work Description: autumnust commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285198515 ## File path: gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java ## @@ -39,32 +42,109 @@ * check if the schema matches the expected schema. If not it will abort the job. */ -public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor{ +public class FileAwareInputStreamExtractorWithCheckSchema extends FileAwareInputStreamExtractor { - public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file, WorkUnitState state) - { + public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file, WorkUnitState state) { super(fs, file, state); } - public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file) - { + + public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, CopyableFile file) { this(fs, file, null); } @Override - protected FileAwareInputStream buildStream(FileSystem fsFromFile) - throws DataRecordException, IOException{ -if(!schemaChecking(fsFromFile)) -{ + protected FileAwareInputStream buildStream(FileSystem fsFromFile) throws DataRecordException, IOException { +if (!schemaChecking(fsFromFile)) { throw new DataRecordException("Schema does not match the expected schema"); } return super.buildStream(fsFromFile); } - protected boolean schemaChecking(FileSystem fsFromFile) - throws IOException { + protected boolean schemaChecking(FileSystem fsFromFile) throws IOException { Review comment: What is the scope of this schema checking ? Is this verifying all fields inside `toValidateSchema` is appearing in expected schema ? Please add javadoc on this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 244126) Time Spent: 50m (was: 40m) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=243713=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-243713 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 16/May/19 22:59 Start Date: 16/May/19 22:59 Worklog Time Spent: 10m Work Description: ZihanLi58 commented on pull request #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637 Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-772 ### Description - [ ] Here are some details about my PR, including screenshots (if applicable): We need a schema comparison strategy to make sure the real schema and the expected schema have matching field names and types. ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: Make the real schema and expected schema have different name or type and make sure the method return false. And make they have matching name and type but not the same doc and make sure the method return true. ### Commits - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 243713) Time Spent: 10m Remaining Estimate: 0h > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp
[ https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=243716=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-243716 ] ASF GitHub Bot logged work on GOBBLIN-772: -- Author: ASF GitHub Bot Created on: 16/May/19 23:02 Start Date: 16/May/19 23:02 Worklog Time Spent: 10m Work Description: ZihanLi58 commented on issue #2637: [GOBBLIN-772]Implement Schema Comparison Strategy during Disctp URL: https://github.com/apache/incubator-gobblin/pull/2637#issuecomment-493259924 @ibuenros @autumnust Can you take a look at this code? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 243716) Time Spent: 20m (was: 10m) > Implement Schema Comparison Strategy during Disctp > -- > > Key: GOBBLIN-772 > URL: https://issues.apache.org/jira/browse/GOBBLIN-772 > Project: Apache Gobblin > Issue Type: Task >Reporter: Zihan Li >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > We need a schema comparison strategy to make sure the real schema and the > expected schema have matching field names and types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)