[ 
https://issues.apache.org/jira/browse/GOBBLIN-249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162830#comment-16162830
 ] 

Tilak Patidar commented on GOBBLIN-249:
---------------------------------------

Link to sample documentation I started working on 
[https://gist.github.com/tilakpatidar/2591c8f4503bcbd0bc0ab212b31ec9b5]

> Documenting an abstract source schema specification to convert records to 
> different formats
> -------------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-249
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-249
>             Project: Apache Gobblin
>          Issue Type: Wish
>          Components: gobblin-core
>            Reporter: Tilak Patidar
>            Assignee: Abhishek Tiwari
>
> Various converters are using the source.schema value to convert source record 
> into respective data formats providing the support for data types both 
> primitive and complex. It seems like we should write down a specification for 
> defining a source.schema. The specification should include instructions on:
> * Converters and their use case <Source, Target>.
> * Converters and the data types supported by them.
> * List of data types and their properties.
> * Examples of writing schema both nested and simple.
> * List of configuration values used by converters.
> * List of various options available for defining the schema of a field. 
> (size, nullable etc)
> This source.schema would act as an abstraction over the underlying schemas 
> and data types of different formats such as Avro, Parquet, ORC etc. The user 
> will define the source.schema adhering to our specification and can convert 
> and write to different data format without worrying about target data format 
> schema.
> Data type abstraction
> For example, Parquet does not have MAP type, but a map can be created by 
> using a repeatable group in parquet. If the user defines a MAP on source 
> schema we can do the necessary conversion and provide him with a MAP like 
> structure in Parquet. In this way, the user is freed from the concern of type 
> conversion and target schema. And maybe the converters can be made a separate 
> module acting as conversion library for different data formats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to