[jira] [Created] (GOBBLIN-514) AvroUtils#parseSchemaFromFile fails when characters are written with Modified UTF-8 encoding
Aditya Sharma created GOBBLIN-514: - Summary: AvroUtils#parseSchemaFromFile fails when characters are written with Modified UTF-8 encoding Key: GOBBLIN-514 URL: https://issues.apache.org/jira/browse/GOBBLIN-514 Project: Apache Gobblin Issue Type: Bug Reporter: Aditya Sharma Assignee: Aditya Sharma Schema.Parser()#parse(InputStream) tries to read the bytes with character encoding UTF-8 and fails when data is encoded with modified UTF-8. Reading the schema file with UTF-8 and then converting it to schema should solve this problem. As a part of [https://github.com/apache/incubator-gobblin/pull/2355] schema is created using Hive Columns, which will be written to the disk using modified UTF-8. When such a file is read using Schema.Parser()#parse(InputStream) it fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (GOBBLIN-478) Lineage events are not getting emitted during Avro2ORC conversion
[ https://issues.apache.org/jira/browse/GOBBLIN-478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454488#comment-16454488 ] Aditya Sharma commented on GOBBLIN-478: --- For this lineage event, source information is set by HiveAvroToOrcSource and destination info is set by HiveConvertPublisher. HiveConvertPublisher gets collection of WorkUnitStates. On invoking getWorkUnit() method on WorkUnitState, publisher was expecting HiveWorkUnit while the returned workunit was ImmutableWorkUnit. > Lineage events are not getting emitted during Avro2ORC conversion > - > > Key: GOBBLIN-478 > URL: https://issues.apache.org/jira/browse/GOBBLIN-478 > Project: Apache Gobblin > Issue Type: Bug >Reporter: Aditya Sharma >Assignee: Aditya Sharma >Priority: Major > > Job execution logs have the following line: > Submitted 0 lineage events for dataset -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (GOBBLIN-478) Lineage events are not getting emitted during Avro2ORC conversion
Aditya Sharma created GOBBLIN-478: - Summary: Lineage events are not getting emitted during Avro2ORC conversion Key: GOBBLIN-478 URL: https://issues.apache.org/jira/browse/GOBBLIN-478 Project: Apache Gobblin Issue Type: Bug Reporter: Aditya Sharma Assignee: Aditya Sharma -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (GOBBLIN-463) Change lineage event for Avro2Orc conversion to have underlying FileSystem as platform
Aditya Sharma created GOBBLIN-463: - Summary: Change lineage event for Avro2Orc conversion to have underlying FileSystem as platform Key: GOBBLIN-463 URL: https://issues.apache.org/jira/browse/GOBBLIN-463 Project: Apache Gobblin Issue Type: Improvement Reporter: Aditya Sharma Assignee: Aditya Sharma Currently the lineage event for Avro2Orc conversion is from platform hive to hive. Platform should be changed to underlying FileSystem. Lineage for a hive table can be derived from the hdfs location of the hive table, if lineage for that hdfs location exists but the other way is not possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (GOBBLIN-450) Improve logging and error messages for Avro2Orc conversion
Aditya Sharma created GOBBLIN-450: - Summary: Improve logging and error messages for Avro2Orc conversion Key: GOBBLIN-450 URL: https://issues.apache.org/jira/browse/GOBBLIN-450 Project: Apache Gobblin Issue Type: Improvement Components: misc Reporter: Aditya Sharma Assignee: Aditya Sharma In case of failures during avro2orc conversion, it is difficult to find the hive table/partition for which exception occurred. For example: {noformat} Atleast one destination format should be specified at hive.conversion.avro.destinationFormats. If you do not intend to convert this dataset set hive.conversion.avro.is.blacklisted to true {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (GOBBLIN-414) Add lineage event for convertible hive datasets
Aditya Sharma created GOBBLIN-414: - Summary: Add lineage event for convertible hive datasets Key: GOBBLIN-414 URL: https://issues.apache.org/jira/browse/GOBBLIN-414 Project: Apache Gobblin Issue Type: Improvement Reporter: Aditya Sharma Assignee: Aditya Sharma Convertible Hive Datasets are used to convert a dataset from one format to multiple formats. Emitting lineage events will help to keep track of source and destination datasets. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (GOBBLIN-399) Refactor HiveSource#shouldCreateWorkunit() to accept table as parameter
Aditya Sharma created GOBBLIN-399: - Summary: Refactor HiveSource#shouldCreateWorkunit() to accept table as parameter Key: GOBBLIN-399 URL: https://issues.apache.org/jira/browse/GOBBLIN-399 Project: Apache Gobblin Issue Type: Improvement Components: misc Reporter: Aditya Sharma Assignee: Aditya Sharma HiveSource doesn't contain a method shouldCreateWorkunit which accepts Table as the parameter. Classes extending HiveSource might need to decide creation of work units based on the table. And Hence by overriding this method it will let them achieve the functionality. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (GOBBLIN-376) Kafka Schema Registration for RecordWithMetadata and BytesToRecordWithMetadataConverter
[ https://issues.apache.org/jira/browse/GOBBLIN-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Sharma updated GOBBLIN-376: -- Description: RecordWithMetadataSchemaRegistrationConverter: An Avro Schema might need to be registered to Kafka. Using RecordWithMetadata registered schema id can be added to record metadata. Hence adding a converter with takes RecordWithMetadata, registers Avro schema with Kafka schema registry, adds the schema id to metadata and returns RecordWithMetadata will be useful. BytestoRecordWithMetadata converter: To wrap bytes along with metadata Use case: Needed a converter which takes serialized Avro record (byte array) and schema as the string, and returns the record along with the registered schemaId. RecordWithMetadata object can be used to carry the schemaId along with the record(in this case byte array). Hence the above two converters in order of BytestoRecordWithMetadata and RecordWithMetadataSchemaRegistrationConverter will do the work. was: An Avro Schema might need to be registered to Kafka. Using RecordWithMetadata registered schema id can be added to record metadata. Hence adding a converter with takes RecordWithMetadata, registers Avro schema with Kafka schema registry, adds the schema id to metadata and returns RecordWithMetadata will be useful. BytestoRecordWithMetadata converter: To wrap bytes along with metadata Use case: Needed a converter which takes serialized avro record (byte array) and schema as string, and returns the record along with the registered schemaId. RecordWithMetadata object can be used to carry the schemaId along with the record(in this case byte array). > Kafka Schema Registration for RecordWithMetadata and > BytesToRecordWithMetadataConverter > --- > > Key: GOBBLIN-376 > URL: https://issues.apache.org/jira/browse/GOBBLIN-376 > Project: Apache Gobblin > Issue Type: New Feature > Components: misc >Reporter: Aditya Sharma >Assignee: Aditya Sharma >Priority: Major > > RecordWithMetadataSchemaRegistrationConverter: An Avro Schema might need to > be registered to Kafka. Using RecordWithMetadata registered schema id can be > added to record metadata. Hence adding a converter with takes > RecordWithMetadata, registers Avro schema with Kafka schema registry, adds > the schema id to metadata and returns RecordWithMetadata will be useful. > BytestoRecordWithMetadata converter: To wrap bytes along with metadata > Use case: Needed a converter which takes serialized Avro record (byte array) > and schema as the string, and returns the record along with the registered > schemaId. > RecordWithMetadata object can be used to carry the schemaId along with the > record(in this case byte array). Hence the above two converters in order of > BytestoRecordWithMetadata and RecordWithMetadataSchemaRegistrationConverter > will do the work. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (GOBBLIN-313) Option to explicitly set group name for staging and final destination directories for Avro-To-Orc conversion
Aditya Sharma created GOBBLIN-313: - Summary: Option to explicitly set group name for staging and final destination directories for Avro-To-Orc conversion Key: GOBBLIN-313 URL: https://issues.apache.org/jira/browse/GOBBLIN-313 Project: Apache Gobblin Issue Type: Improvement Reporter: Aditya Sharma Assignee: Aditya Sharma Currently Avro-To-Orc conversion job tries to preserve group name during conversion. That is the group name for the destination directory will be the same as source directory. There should be an option explicitly define the group name for top-level destination directory and immediate child directories (staging/final directory) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (GOBBLIN-297) Changing access modifier to Protected for HiveSource and Watermarker classes
Aditya Sharma created GOBBLIN-297: - Summary: Changing access modifier to Protected for HiveSource and Watermarker classes Key: GOBBLIN-297 URL: https://issues.apache.org/jira/browse/GOBBLIN-297 Project: Apache Gobblin Issue Type: Improvement Reporter: Aditya Sharma Assignee: Aditya Sharma Access Modifiers is private for most of the variables/methods in HiveSource, PartitionLevelWatermarker and TableLevelWatermark. Extending these classes result in duplication of code. Hence changing the access modifier to protected at required places. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (GOBBLIN-267) HiveSource creates workunit even when update time is before maxLookBackDays
Aditya Sharma created GOBBLIN-267: - Summary: HiveSource creates workunit even when update time is before maxLookBackDays Key: GOBBLIN-267 URL: https://issues.apache.org/jira/browse/GOBBLIN-267 Project: Apache Gobblin Issue Type: Bug Components: misc Reporter: Aditya Sharma Assignee: Aditya Sharma org.apache.gobblin.data.management.conversion.hive.source.HiveSource creates workunit if: 1) Create time is after maxLookBackDays 2) Update time is greater than watermark Since there are multiple policies to decide update time, it can happen that create time is greater than update time and hence maxLookBackDays will be redundant. HiveSource should check for maxLookBackDays corresponding to update time -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (GOBBLIN-247) avro-to-orc conversion validation job should fail only on data mismatch
Aditya Sharma created GOBBLIN-247: - Summary: avro-to-orc conversion validation job should fail only on data mismatch Key: GOBBLIN-247 URL: https://issues.apache.org/jira/browse/GOBBLIN-247 Project: Apache Gobblin Issue Type: Bug Reporter: Aditya Sharma Avro to orc validation job fails if there is a failure in validation query. It should fail only when there is data mismatch -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (GOBBLIN-223) CsvToJsonConverter should throw DataConversionException
Aditya Sharma created GOBBLIN-223: - Summary: CsvToJsonConverter should throw DataConversionException Key: GOBBLIN-223 URL: https://issues.apache.org/jira/browse/GOBBLIN-223 Project: Apache Gobblin Issue Type: Bug Reporter: Aditya Sharma Assignee: Aditya Sharma CsvToJsonConverter should throw DataConversionException when it fails to convert. This can happen due to schema mismatch. One example is while processing footer. If it throws DataConversionException then it can handled, if a person sets the property "task.skip.err.records" -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (GOBBLIN-177) Allow error limit to skip records which are not convertible
Aditya Sharma created GOBBLIN-177: - Summary: Allow error limit to skip records which are not convertible Key: GOBBLIN-177 URL: https://issues.apache.org/jira/browse/GOBBLIN-177 Project: Apache Gobblin Issue Type: New Feature Reporter: Aditya Sharma Priority: Minor Gobblin currently fails if a record is not convertible by the converter class. Allowing an error limit will help users significantly. Use case: If a CSV file contains a footer, which needs to be skipped, by setting an error limit of 1, users will be able to ETL data from a CSV file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (GOBBLIN-176) Gobblin build is failing with missing dependency jetty-http
Aditya Sharma created GOBBLIN-176: - Summary: Gobblin build is failing with missing dependency jetty-http Key: GOBBLIN-176 URL: https://issues.apache.org/jira/browse/GOBBLIN-176 Project: Apache Gobblin Issue Type: Bug Reporter: Aditya Sharma gobblin-restli module fails to build on travis as well as locally. The class FlowConfigTest.java imports "org.eclipse.jetty.http.HttpStatus" which is present in jar org.eclipse.jetty:jetty-http Error Summary: /gobblin-github/gobblin-restli/gobblin-flow-config-service/gobblin-flow-config-service-client/src/test/java/gobblin/service/FlowConfigTest.java:26: error: package org.eclipse.jetty.http does not exist -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (GOBBLIN-176) Gobblin build is failing with missing dependency jetty-http
[ https://issues.apache.org/jira/browse/GOBBLIN-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Sharma reassigned GOBBLIN-176: - Assignee: Aditya Sharma > Gobblin build is failing with missing dependency jetty-http > --- > > Key: GOBBLIN-176 > URL: https://issues.apache.org/jira/browse/GOBBLIN-176 > Project: Apache Gobblin > Issue Type: Bug >Reporter: Aditya Sharma >Assignee: Aditya Sharma > > gobblin-restli module fails to build on travis as well as locally. The class > FlowConfigTest.java imports "org.eclipse.jetty.http.HttpStatus" which is > present in jar org.eclipse.jetty:jetty-http > Error Summary: > /gobblin-github/gobblin-restli/gobblin-flow-config-service/gobblin-flow-config-service-client/src/test/java/gobblin/service/FlowConfigTest.java:26: > error: package org.eclipse.jetty.http does not exist -- This message was sent by Atlassian JIRA (v6.4.14#64029)