[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-06-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=268126=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-268126
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 27/Jun/19 00:41
Start Date: 27/Jun/19 00:41
Worklog Time Spent: 10m 
  Work Description: jhsenjaliya commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r297918336
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java
 ##
 @@ -39,32 +43,38 @@
  * check if the schema matches the expected schema. If not it will abort the 
job.
  */
 
-public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor{
+public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor {
 
 Review comment:
   ok, sure. Thanks
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 268126)
Time Spent: 2h 50m  (was: 2h 40m)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
> Fix For: 0.15.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-28 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=249626=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-249626
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 28/May/19 20:37
Start Date: 28/May/19 20:37
Worklog Time Spent: 10m 
  Work Description: asfgit commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 249626)
Time Spent: 2h 40m  (was: 2.5h)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
> Fix For: 0.15.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-28 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=249625=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-249625
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 28/May/19 20:36
Start Date: 28/May/19 20:36
Worklog Time Spent: 10m 
  Work Description: ibuenros commented on issue #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: 
https://github.com/apache/incubator-gobblin/pull/2637#issuecomment-496679381
 
 
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 249625)
Time Spent: 2.5h  (was: 2h 20m)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-24 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=248144=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-248144
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 24/May/19 16:57
Start Date: 24/May/19 16:57
Worklog Time Spent: 10m 
  Work Description: autumnust commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r287440685
 
 

 ##
 File path: 
gobblin-restli/gobblin-throttling-service/gobblin-throttling-service-api/src/main/snapshot/org.apache.gobblin.restli.throttling.permits.snapshot.json
 ##
 @@ -17,6 +17,18 @@
   "type" : "long",
   "doc" : "Client should not try to acquire permits before this delay has 
passed.",
   "optional" : true
+}, {
 
 Review comment:
   Got the point. Just mention in the description that why this changes (which 
is unrelated to the PR itself) shows up as part of code change. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 248144)
Time Spent: 2h 10m  (was: 2h)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-24 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=248143=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-248143
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 24/May/19 16:57
Start Date: 24/May/19 16:57
Worklog Time Spent: 10m 
  Work Description: autumnust commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r287436534
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/util/schema_check/AvroSchemaCheckStrategy.java
 ##
 @@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.util.schema_check;
+
+import org.apache.avro.Schema;
+import org.apache.gobblin.configuration.ConfigurationKeys;
+import org.apache.gobblin.configuration.WorkUnitState;
+
+
+/**
+ * The strategy to compare Avro schema.
+ */
+public interface AvroSchemaCheckStrategy {
+  /**
+   * A factory to initiate the Strategy
+   */
+  class AvroSchemaCheckStrategyFactory {
+/**
+ * Use the configuration to create a schema check strategy. If it's not 
found, return null.
+ * @param state
+ * @return
+ */
+public static AvroSchemaCheckStrategy create(WorkUnitState state)
+{
+  try {
+return (AvroSchemaCheckStrategy) 
Class.forName(state.getProp(ConfigurationKeys.AVRO_SCHEMA_CHECK_STRATEGY, 
ConfigurationKeys.AVRO_SCHEMA_CHECK_STRATEGY_DEFAULT)).newInstance();
+  } catch (Exception e)
+  {
 
 Review comment:
   nitpick: usually we would have `{` in the same line of `catch`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 248143)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-19 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244813=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244813
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 20/May/19 05:23
Start Date: 20/May/19 05:23
Worklog Time Spent: 10m 
  Work Description: jhsenjaliya commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285431587
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java
 ##
 @@ -39,32 +43,38 @@
  * check if the schema matches the expected schema. If not it will abort the 
job.
  */
 
-public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor{
+public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor {
 
 Review comment:
   just a comment on this class: i understand that this class already exists 
and its adding proper field level schema check, but can schemaCheck not be part 
of `FileAwareInputStreamExtractor` itself instead of introducing another class 
for schemaCheck functionality, since that is the only difference with this 
class which probably can be controlled by a schemaCheck flag. may be its a 
separate PR.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 244813)
Time Spent: 1h 50m  (was: 1h 40m)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244177=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244177
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 17/May/19 18:19
Start Date: 17/May/19 18:19
Worklog Time Spent: 10m 
  Work Description: autumnust commented on issue #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: 
https://github.com/apache/incubator-gobblin/pull/2637#issuecomment-493550408
 
 
   @ZihanLi58 Travis is failing, can you check if it is related to your 
changes? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 244177)
Time Spent: 1h 40m  (was: 1.5h)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244176=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244176
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 17/May/19 18:18
Start Date: 17/May/19 18:18
Worklog Time Spent: 10m 
  Work Description: autumnust commented on issue #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: 
https://github.com/apache/incubator-gobblin/pull/2637#issuecomment-493550408
 
 
   @ZihanLi58 Travis is failing
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 244176)
Time Spent: 1.5h  (was: 1h 20m)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244136=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244136
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 17/May/19 16:54
Start Date: 17/May/19 16:54
Worklog Time Spent: 10m 
  Work Description: ZihanLi58 commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285206877
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java
 ##
 @@ -39,32 +42,109 @@
  * check if the schema matches the expected schema. If not it will abort the 
job.
  */
 
-public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor{
+public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor {
 
-  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file, WorkUnitState state)
-  {
+  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file, WorkUnitState state) {
 super(fs, file, state);
   }
-  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file)
-  {
+
+  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file) {
 this(fs, file, null);
   }
 
   @Override
-  protected FileAwareInputStream buildStream(FileSystem fsFromFile)
-  throws DataRecordException, IOException{
-if(!schemaChecking(fsFromFile))
-{
+  protected FileAwareInputStream buildStream(FileSystem fsFromFile) throws 
DataRecordException, IOException {
+if (!schemaChecking(fsFromFile)) {
   throw new DataRecordException("Schema does not match the expected 
schema");
 }
 return super.buildStream(fsFromFile);
   }
 
-  protected boolean schemaChecking(FileSystem fsFromFile)
-  throws IOException {
+  protected boolean schemaChecking(FileSystem fsFromFile) throws IOException {
 DatumReader datumReader = new GenericDatumReader<>();
-DataFileReader dataFileReader = new DataFileReader(new 
FsInput(this.file.getFileStatus().getPath(),fsFromFile), datumReader);
+DataFileReader dataFileReader =
+new DataFileReader(new FsInput(this.file.getFileStatus().getPath(), 
fsFromFile), datumReader);
 Schema schema = dataFileReader.getSchema();
-return 
schema.toString().equals(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA));
+Schema expectedSchema = new 
Schema.Parser().parse(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA));
+
+return compare(schema, expectedSchema);
+  }
+
+  private boolean compare(Schema toValidate, Schema expected) {
+if (toValidate.getType() != expected.getType() || 
!toValidate.getName().equals(expected.getName())) {return false;}
+else {
+  switch (toValidate.getType()) {
+case NULL:
+case BOOLEAN:
+case INT:
+case LONG:
+case FLOAT:
+case DOUBLE:
+case BYTES:
+case STRING: {
+  return true;
+}
+case ARRAY: {
+  return compare(toValidate.getElementType(), 
expected.getElementType());
+}
+case MAP: {
+  return compare(toValidate.getValueType(), expected.getValueType());
+}
+case FIXED: {
+  // fixed size and name must match:
+  if (toValidate.getFixedSize() != expected.getFixedSize()) {
+return false;
+  }
+}
+case ENUM: {
+  // expected symbols must contain all toValidate symbols:
+  final Set expectedSymbols = new 
HashSet(expected.getEnumSymbols());
+  final Set toValidateSymbols = new 
HashSet(toValidate.getEnumSymbols());
+  if (expectedSymbols.size() != toValidateSymbols.size()) {
+return false;
+  }
+  if (!expectedSymbols.containsAll(toValidateSymbols)) {
+return false;
+  }
+}
+
+case RECORD: {
+  // Check that each field of toValidate schema is in expected schema
+  if(toValidate.getFields().size() != expected.getFields().size()) 
{return false;}
+  for (final Schema.Field expectedFiled : expected.getFields()) {
+final Schema.Field toValidateField = 
toValidate.getField(expectedFiled.name());
+if (toValidateField == null) {
+  // expected field does not correspond to any field in the 
toValidate record schema
+  return false;
+} else {
+  if (!compare(toValidateField.schema(), expectedFiled.schema())) {
+return false;
+  }
+}
+  }
+  return true;
+}
+case UNION: 

[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244133=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244133
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 17/May/19 16:48
Start Date: 17/May/19 16:48
Worklog Time Spent: 10m 
  Work Description: ZihanLi58 commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285204856
 
 

 ##
 File path: 
gobblin-restli/gobblin-throttling-service/gobblin-throttling-service-api/src/main/snapshot/org.apache.gobblin.restli.throttling.permits.snapshot.json
 ##
 @@ -17,6 +17,18 @@
   "type" : "long",
   "doc" : "Client should not try to acquire permits before this delay has 
passed.",
   "optional" : true
+}, {
 
 Review comment:
   Every time I build the project, this change will be automatically made.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 244133)
Time Spent: 1h 10m  (was: 1h)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244125=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244125
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 17/May/19 16:30
Start Date: 17/May/19 16:30
Worklog Time Spent: 10m 
  Work Description: autumnust commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285198718
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java
 ##
 @@ -39,32 +42,109 @@
  * check if the schema matches the expected schema. If not it will abort the 
job.
  */
 
-public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor{
+public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor {
 
-  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file, WorkUnitState state)
-  {
+  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file, WorkUnitState state) {
 super(fs, file, state);
   }
-  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file)
-  {
+
+  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file) {
 this(fs, file, null);
   }
 
   @Override
-  protected FileAwareInputStream buildStream(FileSystem fsFromFile)
-  throws DataRecordException, IOException{
-if(!schemaChecking(fsFromFile))
-{
+  protected FileAwareInputStream buildStream(FileSystem fsFromFile) throws 
DataRecordException, IOException {
+if (!schemaChecking(fsFromFile)) {
   throw new DataRecordException("Schema does not match the expected 
schema");
 }
 return super.buildStream(fsFromFile);
   }
 
-  protected boolean schemaChecking(FileSystem fsFromFile)
-  throws IOException {
+  protected boolean schemaChecking(FileSystem fsFromFile) throws IOException {
 DatumReader datumReader = new GenericDatumReader<>();
-DataFileReader dataFileReader = new DataFileReader(new 
FsInput(this.file.getFileStatus().getPath(),fsFromFile), datumReader);
+DataFileReader dataFileReader =
+new DataFileReader(new FsInput(this.file.getFileStatus().getPath(), 
fsFromFile), datumReader);
 Schema schema = dataFileReader.getSchema();
-return 
schema.toString().equals(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA));
+Schema expectedSchema = new 
Schema.Parser().parse(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA));
+
+return compare(schema, expectedSchema);
+  }
+
+  private boolean compare(Schema toValidate, Schema expected) {
+if (toValidate.getType() != expected.getType() || 
!toValidate.getName().equals(expected.getName())) {return false;}
+else {
+  switch (toValidate.getType()) {
+case NULL:
+case BOOLEAN:
+case INT:
+case LONG:
+case FLOAT:
+case DOUBLE:
+case BYTES:
+case STRING: {
+  return true;
+}
+case ARRAY: {
+  return compare(toValidate.getElementType(), 
expected.getElementType());
+}
+case MAP: {
+  return compare(toValidate.getValueType(), expected.getValueType());
+}
+case FIXED: {
+  // fixed size and name must match:
+  if (toValidate.getFixedSize() != expected.getFixedSize()) {
+return false;
+  }
+}
+case ENUM: {
+  // expected symbols must contain all toValidate symbols:
+  final Set expectedSymbols = new 
HashSet(expected.getEnumSymbols());
+  final Set toValidateSymbols = new 
HashSet(toValidate.getEnumSymbols());
+  if (expectedSymbols.size() != toValidateSymbols.size()) {
+return false;
+  }
+  if (!expectedSymbols.containsAll(toValidateSymbols)) {
+return false;
+  }
+}
+
+case RECORD: {
+  // Check that each field of toValidate schema is in expected schema
+  if(toValidate.getFields().size() != expected.getFields().size()) 
{return false;}
+  for (final Schema.Field expectedFiled : expected.getFields()) {
+final Schema.Field toValidateField = 
toValidate.getField(expectedFiled.name());
+if (toValidateField == null) {
+  // expected field does not correspond to any field in the 
toValidate record schema
+  return false;
+} else {
+  if (!compare(toValidateField.schema(), expectedFiled.schema())) {
+return false;
+  }
+}
+  }
+  return true;
+}
+case UNION: 

[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244124=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244124
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 17/May/19 16:30
Start Date: 17/May/19 16:30
Worklog Time Spent: 10m 
  Work Description: autumnust commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285199219
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java
 ##
 @@ -39,32 +42,109 @@
  * check if the schema matches the expected schema. If not it will abort the 
job.
  */
 
-public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor{
+public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor {
 
-  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file, WorkUnitState state)
-  {
+  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file, WorkUnitState state) {
 super(fs, file, state);
   }
-  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file)
-  {
+
+  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file) {
 this(fs, file, null);
   }
 
   @Override
-  protected FileAwareInputStream buildStream(FileSystem fsFromFile)
-  throws DataRecordException, IOException{
-if(!schemaChecking(fsFromFile))
-{
+  protected FileAwareInputStream buildStream(FileSystem fsFromFile) throws 
DataRecordException, IOException {
+if (!schemaChecking(fsFromFile)) {
   throw new DataRecordException("Schema does not match the expected 
schema");
 }
 return super.buildStream(fsFromFile);
   }
 
-  protected boolean schemaChecking(FileSystem fsFromFile)
-  throws IOException {
+  protected boolean schemaChecking(FileSystem fsFromFile) throws IOException {
 DatumReader datumReader = new GenericDatumReader<>();
-DataFileReader dataFileReader = new DataFileReader(new 
FsInput(this.file.getFileStatus().getPath(),fsFromFile), datumReader);
+DataFileReader dataFileReader =
+new DataFileReader(new FsInput(this.file.getFileStatus().getPath(), 
fsFromFile), datumReader);
 Schema schema = dataFileReader.getSchema();
-return 
schema.toString().equals(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA));
+Schema expectedSchema = new 
Schema.Parser().parse(this.state.getProp(ConfigurationKeys.COPY_EXPECTED_SCHEMA));
+
+return compare(schema, expectedSchema);
+  }
+
+  private boolean compare(Schema toValidate, Schema expected) {
 
 Review comment:
   Depending on the scope of your schema checking, this method can be reused 
for other purposes. May be make it static method or put in some utilities 
classes? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 244124)
Time Spent: 0.5h  (was: 20m)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244127=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244127
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 17/May/19 16:30
Start Date: 17/May/19 16:30
Worklog Time Spent: 10m 
  Work Description: autumnust commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285190404
 
 

 ##
 File path: 
gobblin-restli/gobblin-throttling-service/gobblin-throttling-service-api/src/main/snapshot/org.apache.gobblin.restli.throttling.permits.snapshot.json
 ##
 @@ -17,6 +17,18 @@
   "type" : "long",
   "doc" : "Client should not try to acquire permits before this delay has 
passed.",
   "optional" : true
+}, {
 
 Review comment:
   Is this changes from other PR? Can you rebase the newest changes so that it 
won't be included? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 244127)
Time Spent: 1h  (was: 50m)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=244126=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-244126
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 17/May/19 16:30
Start Date: 17/May/19 16:30
Worklog Time Spent: 10m 
  Work Description: autumnust commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637#discussion_r285198515
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/extractor/FileAwareInputStreamExtractorWithCheckSchema.java
 ##
 @@ -39,32 +42,109 @@
  * check if the schema matches the expected schema. If not it will abort the 
job.
  */
 
-public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor{
+public class FileAwareInputStreamExtractorWithCheckSchema extends 
FileAwareInputStreamExtractor {
 
-  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file, WorkUnitState state)
-  {
+  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file, WorkUnitState state) {
 super(fs, file, state);
   }
-  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file)
-  {
+
+  public FileAwareInputStreamExtractorWithCheckSchema(FileSystem fs, 
CopyableFile file) {
 this(fs, file, null);
   }
 
   @Override
-  protected FileAwareInputStream buildStream(FileSystem fsFromFile)
-  throws DataRecordException, IOException{
-if(!schemaChecking(fsFromFile))
-{
+  protected FileAwareInputStream buildStream(FileSystem fsFromFile) throws 
DataRecordException, IOException {
+if (!schemaChecking(fsFromFile)) {
   throw new DataRecordException("Schema does not match the expected 
schema");
 }
 return super.buildStream(fsFromFile);
   }
 
-  protected boolean schemaChecking(FileSystem fsFromFile)
-  throws IOException {
+  protected boolean schemaChecking(FileSystem fsFromFile) throws IOException {
 
 Review comment:
   What is the scope of this schema checking ? Is this verifying all fields 
inside `toValidateSchema` is appearing in expected schema ? Please add javadoc 
on this. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 244126)
Time Spent: 50m  (was: 40m)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=243713=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-243713
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 16/May/19 22:59
Start Date: 16/May/19 22:59
Worklog Time Spent: 10m 
  Work Description: ZihanLi58 commented on pull request #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: https://github.com/apache/incubator-gobblin/pull/2637
 
 
   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I 
have checked off all the steps below!
   
   
   ### JIRA
   - [ ] My PR addresses the following [Gobblin 
JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references 
them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
   - https://issues.apache.org/jira/browse/GOBBLIN-772
   
   
   ### Description
   - [ ] Here are some details about my PR, including screenshots (if 
applicable):
   We need a schema comparison strategy to make sure the real schema and the 
expected schema have matching field names and types.
   
   ### Tests
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   Make the real schema and expected schema have different name or type and 
make sure the method return false. And make they have matching name and type 
but not the same doc and make sure the method return true.
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
   1. Subject is separated from body by a blank line
   2. Subject is limited to 50 characters
   3. Subject does not end with a period
   4. Subject uses the imperative mood ("add", not "adding")
   5. Body wraps at 72 characters
   6. Body explains "what" and "why", not "how"
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 243713)
Time Spent: 10m
Remaining Estimate: 0h

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-772) Implement Schema Comparison Strategy during Disctp

2019-05-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-772?focusedWorklogId=243716=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-243716
 ]

ASF GitHub Bot logged work on GOBBLIN-772:
--

Author: ASF GitHub Bot
Created on: 16/May/19 23:02
Start Date: 16/May/19 23:02
Worklog Time Spent: 10m 
  Work Description: ZihanLi58 commented on issue #2637: 
[GOBBLIN-772]Implement Schema Comparison Strategy during Disctp
URL: 
https://github.com/apache/incubator-gobblin/pull/2637#issuecomment-493259924
 
 
   @ibuenros @autumnust Can you take a look at this code?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 243716)
Time Spent: 20m  (was: 10m)

> Implement Schema Comparison Strategy during Disctp
> --
>
> Key: GOBBLIN-772
> URL: https://issues.apache.org/jira/browse/GOBBLIN-772
> Project: Apache Gobblin
>  Issue Type: Task
>Reporter: Zihan Li
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We need a schema comparison strategy to make sure the real schema and the 
> expected schema have matching field names and types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)