[jira] [Created] (GOBBLIN-514) AvroUtils#parseSchemaFromFile fails when characters are written with Modified UTF-8 encoding

2018-06-12 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-514:
-

 Summary: AvroUtils#parseSchemaFromFile fails when characters are 
written with Modified UTF-8 encoding
 Key: GOBBLIN-514
 URL: https://issues.apache.org/jira/browse/GOBBLIN-514
 Project: Apache Gobblin
  Issue Type: Bug
Reporter: Aditya Sharma
Assignee: Aditya Sharma


Schema.Parser()#parse(InputStream) tries to read the bytes with character 
encoding UTF-8 and fails when data is encoded with modified UTF-8. Reading the 
schema file with UTF-8 and then converting it to schema should solve this 
problem.

As a part of [https://github.com/apache/incubator-gobblin/pull/2355] schema is 
created using Hive Columns, which will be written to the disk using modified 
UTF-8. When such a file is read using Schema.Parser()#parse(InputStream) it 
fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (GOBBLIN-478) Lineage events are not getting emitted during Avro2ORC conversion

2018-04-26 Thread Aditya Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/GOBBLIN-478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454488#comment-16454488
 ] 

Aditya Sharma commented on GOBBLIN-478:
---

For this lineage event, source information is set by HiveAvroToOrcSource and 
destination info is set by HiveConvertPublisher. HiveConvertPublisher gets 
collection of WorkUnitStates. On invoking getWorkUnit() method on 
WorkUnitState, publisher was expecting HiveWorkUnit while the returned workunit 
was ImmutableWorkUnit.

> Lineage events are not getting emitted during Avro2ORC conversion
> -
>
> Key: GOBBLIN-478
> URL: https://issues.apache.org/jira/browse/GOBBLIN-478
> Project: Apache Gobblin
>  Issue Type: Bug
>Reporter: Aditya Sharma
>Assignee: Aditya Sharma
>Priority: Major
>
> Job execution logs have the following line:
> Submitted 0 lineage events for dataset 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (GOBBLIN-478) Lineage events are not getting emitted during Avro2ORC conversion

2018-04-26 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-478:
-

 Summary: Lineage events are not getting emitted during Avro2ORC 
conversion
 Key: GOBBLIN-478
 URL: https://issues.apache.org/jira/browse/GOBBLIN-478
 Project: Apache Gobblin
  Issue Type: Bug
Reporter: Aditya Sharma
Assignee: Aditya Sharma






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (GOBBLIN-463) Change lineage event for Avro2Orc conversion to have underlying FileSystem as platform

2018-04-10 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-463:
-

 Summary: Change lineage event for Avro2Orc conversion to have 
underlying FileSystem as platform
 Key: GOBBLIN-463
 URL: https://issues.apache.org/jira/browse/GOBBLIN-463
 Project: Apache Gobblin
  Issue Type: Improvement
Reporter: Aditya Sharma
Assignee: Aditya Sharma


Currently the lineage event for Avro2Orc conversion is from platform hive to 
hive. Platform should be changed to underlying FileSystem.

Lineage for a hive table can be derived from the hdfs location of the hive 
table, if lineage for that hdfs location exists but the other way is not 
possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (GOBBLIN-450) Improve logging and error messages for Avro2Orc conversion

2018-03-28 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-450:
-

 Summary: Improve logging and error messages for Avro2Orc conversion
 Key: GOBBLIN-450
 URL: https://issues.apache.org/jira/browse/GOBBLIN-450
 Project: Apache Gobblin
  Issue Type: Improvement
  Components: misc
Reporter: Aditya Sharma
Assignee: Aditya Sharma


In case of failures during avro2orc conversion, it is difficult to find the 
hive table/partition for which exception occurred.

For example:

{noformat}

Atleast one destination format should be specified at 
hive.conversion.avro.destinationFormats. If you do not intend to convert this 
dataset set hive.conversion.avro.is.blacklisted to true

{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (GOBBLIN-414) Add lineage event for convertible hive datasets

2018-02-19 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-414:
-

 Summary: Add lineage event for convertible hive datasets
 Key: GOBBLIN-414
 URL: https://issues.apache.org/jira/browse/GOBBLIN-414
 Project: Apache Gobblin
  Issue Type: Improvement
Reporter: Aditya Sharma
Assignee: Aditya Sharma


Convertible Hive Datasets are used to convert a dataset from one format to 
multiple formats. Emitting lineage events will help to keep track of source and 
destination datasets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (GOBBLIN-399) Refactor HiveSource#shouldCreateWorkunit() to accept table as parameter

2018-01-31 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-399:
-

 Summary: Refactor HiveSource#shouldCreateWorkunit() to accept 
table as parameter
 Key: GOBBLIN-399
 URL: https://issues.apache.org/jira/browse/GOBBLIN-399
 Project: Apache Gobblin
  Issue Type: Improvement
  Components: misc
Reporter: Aditya Sharma
Assignee: Aditya Sharma


HiveSource doesn't contain a method shouldCreateWorkunit which accepts Table as 
the parameter. Classes extending HiveSource might need to decide creation of 
work units based on the table. And Hence by overriding this method it will let 
them achieve the functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (GOBBLIN-376) Kafka Schema Registration for RecordWithMetadata and BytesToRecordWithMetadataConverter

2018-01-18 Thread Aditya Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/GOBBLIN-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Sharma updated GOBBLIN-376:
--
Description: 
RecordWithMetadataSchemaRegistrationConverter: An Avro Schema might need to be 
registered to Kafka. Using RecordWithMetadata registered schema id can be added 
to record metadata. Hence adding a converter with takes RecordWithMetadata, 
registers Avro schema with Kafka schema registry, adds the schema id to 
metadata and returns RecordWithMetadata will be useful.

BytestoRecordWithMetadata converter: To wrap bytes along with metadata

Use case: Needed a converter which takes serialized Avro record (byte array) 
and schema as the string, and returns the record along with the registered 
schemaId.

RecordWithMetadata object can be used to carry the schemaId along with the 
record(in this case byte array). Hence the above two converters in order of 
BytestoRecordWithMetadata and RecordWithMetadataSchemaRegistrationConverter 
will do the work.

  was:
An Avro Schema might need to be registered to Kafka. Using RecordWithMetadata 
registered schema id can be added to record metadata.

Hence adding a converter with takes RecordWithMetadata, registers Avro schema 
with Kafka schema registry, adds the schema id to metadata and returns 
RecordWithMetadata will be useful.

BytestoRecordWithMetadata converter: To wrap bytes along with metadata

Use case:

Needed a converter which takes serialized avro record (byte array) and schema 
as string, and returns the record along with the registered schemaId.

RecordWithMetadata object can be used to carry the schemaId along with the 
record(in this case byte array).


> Kafka Schema Registration for RecordWithMetadata and 
> BytesToRecordWithMetadataConverter
> ---
>
> Key: GOBBLIN-376
> URL: https://issues.apache.org/jira/browse/GOBBLIN-376
> Project: Apache Gobblin
>  Issue Type: New Feature
>  Components: misc
>Reporter: Aditya Sharma
>Assignee: Aditya Sharma
>Priority: Major
>
> RecordWithMetadataSchemaRegistrationConverter: An Avro Schema might need to 
> be registered to Kafka. Using RecordWithMetadata registered schema id can be 
> added to record metadata. Hence adding a converter with takes 
> RecordWithMetadata, registers Avro schema with Kafka schema registry, adds 
> the schema id to metadata and returns RecordWithMetadata will be useful.
> BytestoRecordWithMetadata converter: To wrap bytes along with metadata
> Use case: Needed a converter which takes serialized Avro record (byte array) 
> and schema as the string, and returns the record along with the registered 
> schemaId.
> RecordWithMetadata object can be used to carry the schemaId along with the 
> record(in this case byte array). Hence the above two converters in order of 
> BytestoRecordWithMetadata and RecordWithMetadataSchemaRegistrationConverter 
> will do the work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (GOBBLIN-313) Option to explicitly set group name for staging and final destination directories for Avro-To-Orc conversion

2017-11-09 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-313:
-

 Summary: Option to explicitly set group name for staging and final 
destination directories for Avro-To-Orc conversion
 Key: GOBBLIN-313
 URL: https://issues.apache.org/jira/browse/GOBBLIN-313
 Project: Apache Gobblin
  Issue Type: Improvement
Reporter: Aditya Sharma
Assignee: Aditya Sharma


Currently Avro-To-Orc conversion job tries to preserve group name during 
conversion. That is the group name for the destination directory will be the 
same as source directory.

There should be an option explicitly define the group name for top-level 
destination directory and immediate child directories (staging/final directory)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (GOBBLIN-297) Changing access modifier to Protected for HiveSource and Watermarker classes

2017-10-25 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-297:
-

 Summary: Changing access modifier to Protected for HiveSource and 
Watermarker classes
 Key: GOBBLIN-297
 URL: https://issues.apache.org/jira/browse/GOBBLIN-297
 Project: Apache Gobblin
  Issue Type: Improvement
Reporter: Aditya Sharma
Assignee: Aditya Sharma


Access Modifiers is private for most of the variables/methods in HiveSource, 
PartitionLevelWatermarker and TableLevelWatermark. Extending these classes 
result in duplication of code. Hence changing the access modifier to protected 
at required places.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (GOBBLIN-267) HiveSource creates workunit even when update time is before maxLookBackDays

2017-09-26 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-267:
-

 Summary: HiveSource creates workunit even when update time is 
before maxLookBackDays
 Key: GOBBLIN-267
 URL: https://issues.apache.org/jira/browse/GOBBLIN-267
 Project: Apache Gobblin
  Issue Type: Bug
  Components: misc
Reporter: Aditya Sharma
Assignee: Aditya Sharma


org.apache.gobblin.data.management.conversion.hive.source.HiveSource creates 
workunit if:
1) Create time is after maxLookBackDays
2) Update time is greater than watermark

Since there are multiple policies to decide update time, it can happen that 
create time is greater than update time and hence maxLookBackDays will be 
redundant.

HiveSource should check for maxLookBackDays corresponding to update time



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (GOBBLIN-247) avro-to-orc conversion validation job should fail only on data mismatch

2017-09-10 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-247:
-

 Summary: avro-to-orc conversion validation job should fail only on 
data mismatch 
 Key: GOBBLIN-247
 URL: https://issues.apache.org/jira/browse/GOBBLIN-247
 Project: Apache Gobblin
  Issue Type: Bug
Reporter: Aditya Sharma


Avro to orc validation job fails if there is a failure in validation query. It 
should fail only when there is data mismatch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (GOBBLIN-223) CsvToJsonConverter should throw DataConversionException

2017-08-22 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-223:
-

 Summary: CsvToJsonConverter should throw DataConversionException
 Key: GOBBLIN-223
 URL: https://issues.apache.org/jira/browse/GOBBLIN-223
 Project: Apache Gobblin
  Issue Type: Bug
Reporter: Aditya Sharma
Assignee: Aditya Sharma


CsvToJsonConverter should throw DataConversionException when it fails to 
convert. This can happen due to schema mismatch. One example is while 
processing footer.
If it throws DataConversionException then it can handled, if a person sets the 
property "task.skip.err.records"




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (GOBBLIN-177) Allow error limit to skip records which are not convertible

2017-07-29 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-177:
-

 Summary: Allow error limit to skip records which are not 
convertible
 Key: GOBBLIN-177
 URL: https://issues.apache.org/jira/browse/GOBBLIN-177
 Project: Apache Gobblin
  Issue Type: New Feature
Reporter: Aditya Sharma
Priority: Minor


Gobblin currently fails if a record is not convertible by the converter class. 
Allowing an error limit will help users significantly.
Use case:
If a CSV file contains a footer, which needs to be skipped, by setting an error 
limit of 1, users will be able to ETL data from a CSV file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (GOBBLIN-176) Gobblin build is failing with missing dependency jetty-http

2017-07-29 Thread Aditya Sharma (JIRA)
Aditya Sharma created GOBBLIN-176:
-

 Summary: Gobblin build is failing with missing dependency 
jetty-http
 Key: GOBBLIN-176
 URL: https://issues.apache.org/jira/browse/GOBBLIN-176
 Project: Apache Gobblin
  Issue Type: Bug
Reporter: Aditya Sharma


gobblin-restli module fails to build on travis as well as locally. The class 
FlowConfigTest.java imports "org.eclipse.jetty.http.HttpStatus" which is 
present in jar org.eclipse.jetty:jetty-http

Error Summary:
/gobblin-github/gobblin-restli/gobblin-flow-config-service/gobblin-flow-config-service-client/src/test/java/gobblin/service/FlowConfigTest.java:26:
 error: package org.eclipse.jetty.http does not exist



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (GOBBLIN-176) Gobblin build is failing with missing dependency jetty-http

2017-07-29 Thread Aditya Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/GOBBLIN-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Sharma reassigned GOBBLIN-176:
-

Assignee: Aditya Sharma

> Gobblin build is failing with missing dependency jetty-http
> ---
>
> Key: GOBBLIN-176
> URL: https://issues.apache.org/jira/browse/GOBBLIN-176
> Project: Apache Gobblin
>  Issue Type: Bug
>Reporter: Aditya Sharma
>Assignee: Aditya Sharma
>
> gobblin-restli module fails to build on travis as well as locally. The class 
> FlowConfigTest.java imports "org.eclipse.jetty.http.HttpStatus" which is 
> present in jar org.eclipse.jetty:jetty-http
> Error Summary:
> /gobblin-github/gobblin-restli/gobblin-flow-config-service/gobblin-flow-config-service-client/src/test/java/gobblin/service/FlowConfigTest.java:26:
>  error: package org.eclipse.jetty.http does not exist



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)