[ 
https://issues.apache.org/jira/browse/GOBBLIN-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilak Patidar updated GOBBLIN-249:
----------------------------------
    Description: 
Various converters are using the source.schema value to convert source record 
into respective data formats with support for data types both primitive and 
complex. It seems like we should write down a specification for defining a 
source.schema. The specification should include instructions on:
* Converters and their use case <Source, Target>.
* Converters and the data types supported by them.
* List of data types and their properties.
* Examples of writing schema both nested and simple.
* List of configuration values used by converters.
* List of various options available for defining the schema of a field. (size, 
nullable etc)

This source.schema would act as an abstraction over the underlying schemas and 
data types of different formats such as Avro, Parquet, ORC etc. The user will 
define the source.schema adhering to our specification and can convert and 
write to different data format without worrying about target data format schema.

Data type abstraction
For example, Parquet does not have MAP type, but a map can be created by using 
a repeatable group in parquet. If the user defines a MAP on source schema we 
can do the necessary conversion and provide him with a MAP like structure in 
Parquet. In this way, the user is freed from the concern of type conversion and 
target schema. And maybe the converters can be made a separate module acting as 
conversion library for different data formats.

  was:
Various converters are using the source.schema value to convert source record 
into respective data formats providing the support for data types both 
primitive and complex. It seems like we should write down a specification for 
defining a source.schema. The specification should include instructions on:
* Converters and their use case <Source, Target>.
* Converters and the data types supported by them.
* List of data types and their properties.
* Examples of writing schema both nested and simple.
* List of configuration values used by converters.
* List of various options available for defining the schema of a field. (size, 
nullable etc)

This source.schema would act as an abstraction over the underlying schemas and 
data types of different formats such as Avro, Parquet, ORC etc. The user will 
define the source.schema adhering to our specification and can convert and 
write to different data format without worrying about target data format schema.

Data type abstraction
For example, Parquet does not have MAP type, but a map can be created by using 
a repeatable group in parquet. If the user defines a MAP on source schema we 
can do the necessary conversion and provide him with a MAP like structure in 
Parquet. In this way, the user is freed from the concern of type conversion and 
target schema. And maybe the converters can be made a separate module acting as 
conversion library for different data formats.


> Documenting an abstract source schema specification to convert records to 
> different formats
> -------------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-249
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-249
>             Project: Apache Gobblin
>          Issue Type: Wish
>          Components: gobblin-core
>            Reporter: Tilak Patidar
>            Assignee: Abhishek Tiwari
>
> Various converters are using the source.schema value to convert source record 
> into respective data formats with support for data types both primitive and 
> complex. It seems like we should write down a specification for defining a 
> source.schema. The specification should include instructions on:
> * Converters and their use case <Source, Target>.
> * Converters and the data types supported by them.
> * List of data types and their properties.
> * Examples of writing schema both nested and simple.
> * List of configuration values used by converters.
> * List of various options available for defining the schema of a field. 
> (size, nullable etc)
> This source.schema would act as an abstraction over the underlying schemas 
> and data types of different formats such as Avro, Parquet, ORC etc. The user 
> will define the source.schema adhering to our specification and can convert 
> and write to different data format without worrying about target data format 
> schema.
> Data type abstraction
> For example, Parquet does not have MAP type, but a map can be created by 
> using a repeatable group in parquet. If the user defines a MAP on source 
> schema we can do the necessary conversion and provide him with a MAP like 
> structure in Parquet. In this way, the user is freed from the concern of type 
> conversion and target schema. And maybe the converters can be made a separate 
> module acting as conversion library for different data formats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to