[
https://issues.apache.org/jira/browse/GOBBLIN-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilak Patidar updated GOBBLIN-249:
----------------------------------
Description:
Various converters are using the source.schema value to convert source record
into respective data formats with support for data types both primitive and
complex. It seems like we should write down a specification for defining a
source.schema. The specification should include instructions on:
* Converters and their use case <Source, Target>.
* Converters and the data types supported by them.
* List of data types and their properties.
* Examples of writing schema both nested and simple.
* List of configuration values used by converters.
* List of various options available for defining the schema of a field. (size,
nullable etc)
This source.schema would act as an abstraction over the underlying schemas and
data types of different formats such as Avro, Parquet, ORC etc. The user will
define the source.schema adhering to our specification and can convert and
write to different data format without worrying about target data format schema.
Data type abstraction
For example, Parquet does not have MAP type, but a map can be created by using
a repeatable group in parquet. If the user defines a MAP on source schema we
can do the necessary conversion and provide him with a MAP like structure in
Parquet. In this way, the user is freed from the concern of type conversion and
target schema. And maybe the converters can be made a separate module acting as
conversion library for different data formats.
was:
Various converters are using the source.schema value to convert source record
into respective data formats providing the support for data types both
primitive and complex. It seems like we should write down a specification for
defining a source.schema. The specification should include instructions on:
* Converters and their use case <Source, Target>.
* Converters and the data types supported by them.
* List of data types and their properties.
* Examples of writing schema both nested and simple.
* List of configuration values used by converters.
* List of various options available for defining the schema of a field. (size,
nullable etc)
This source.schema would act as an abstraction over the underlying schemas and
data types of different formats such as Avro, Parquet, ORC etc. The user will
define the source.schema adhering to our specification and can convert and
write to different data format without worrying about target data format schema.
Data type abstraction
For example, Parquet does not have MAP type, but a map can be created by using
a repeatable group in parquet. If the user defines a MAP on source schema we
can do the necessary conversion and provide him with a MAP like structure in
Parquet. In this way, the user is freed from the concern of type conversion and
target schema. And maybe the converters can be made a separate module acting as
conversion library for different data formats.
> Documenting an abstract source schema specification to convert records to
> different formats
> -------------------------------------------------------------------------------------------
>
> Key: GOBBLIN-249
> URL: https://issues.apache.org/jira/browse/GOBBLIN-249
> Project: Apache Gobblin
> Issue Type: Wish
> Components: gobblin-core
> Reporter: Tilak Patidar
> Assignee: Abhishek Tiwari
>
> Various converters are using the source.schema value to convert source record
> into respective data formats with support for data types both primitive and
> complex. It seems like we should write down a specification for defining a
> source.schema. The specification should include instructions on:
> * Converters and their use case <Source, Target>.
> * Converters and the data types supported by them.
> * List of data types and their properties.
> * Examples of writing schema both nested and simple.
> * List of configuration values used by converters.
> * List of various options available for defining the schema of a field.
> (size, nullable etc)
> This source.schema would act as an abstraction over the underlying schemas
> and data types of different formats such as Avro, Parquet, ORC etc. The user
> will define the source.schema adhering to our specification and can convert
> and write to different data format without worrying about target data format
> schema.
> Data type abstraction
> For example, Parquet does not have MAP type, but a map can be created by
> using a repeatable group in parquet. If the user defines a MAP on source
> schema we can do the necessary conversion and provide him with a MAP like
> structure in Parquet. In this way, the user is freed from the concern of type
> conversion and target schema. And maybe the converters can be made a separate
> module acting as conversion library for different data formats.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)