[ 
https://issues.apache.org/jira/browse/FLINK-8203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281731#comment-16281731
 ] 

ASF GitHub Bot commented on FLINK-8203:
---------------------------------------

GitHub user twalthr opened a pull request:

    https://github.com/apache/flink/pull/5132

    [FLINK-8203] [FLINK-7681] [table] Make schema definition of 
DataStream/DataSet to Table conversion more flexible

    ## What is the purpose of the change
    
    This PR makes the schema definition more flexible. It add two ways of 
adding schema information:
    
    Reference input fields by name:
    All fields in the schema definition are referenced by name
    (and possibly renamed using an alias (as). In this mode, fields can be 
reordered and
    projected out. Moreover, we can define proctime and rowtime attributes at 
arbitrary
    positions using arbitrary names (except those that exist in the result 
schema). This mode
    can be used for any input type, including POJOs.
    
    Reference input fields by position:
    Field references must refer to existing fields in the input type (except for
    renaming with alias (as)). In this mode, fields are simply renamed. 
Event-time attributes can
    replace the field on their position in the input data (if it is of correct 
type) or be
    appended at the end. Proctime attributes must be appended at the end. This 
mode can only be
    used if the input type has a defined field order (tuple, case class, Row) 
and no of fields
    references a field of the input type.
    
    It also allows any TypeInformation. In the past, this behavior was not 
consistent.
    
    I will add some paragraphs to the documentation, once we agreed on this new 
behavior.
    
    ## Brief change log
    
    Various changes in `TableEnvironment`, `Stream/BatchTableEnvironment`, and 
pattern matches that referenced `AtomicType` instead of `TypeInformation`.
    
    
    ## Verifying this change
    
    See TableEnvironment tests.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): no
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
      - The serializers: no
      - The runtime per-record code paths (performance sensitive): no
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: no
      - The S3 file system connector: no
    
    ## Documentation
    
      - Does this pull request introduce a new feature? no
      - If yes, how is the feature documented? will document it later


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/twalthr/flink FLINK-8203

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5132.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5132
    
----
commit 38562e1dcc5416996ad5531b901f89e4b868e5eb
Author: twalthr <[email protected]>
Date:   2017-12-07T10:52:28Z

    [FLINK-8203] [FLINK-7681] [table] Make schema definition of 
DataStream/DataSet to Table conversion more flexible

----


> Make schema definition of DataStream/DataSet to Table conversion more flexible
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-8203
>                 URL: https://issues.apache.org/jira/browse/FLINK-8203
>             Project: Flink
>          Issue Type: Bug
>          Components: Table API & SQL
>    Affects Versions: 1.4.0, 1.5.0
>            Reporter: Fabian Hueske
>            Assignee: Timo Walther
>
> When converting or registering a {{DataStream}} or {{DataSet}} as {{Table}}, 
> the schema of the table can be defined (by default it is extracted from the 
> {{TypeInformation}}.
> The schema needs to be manually specified to select (project) fields, rename 
> fields, or define time attributes. Right now, there are several limitations 
> how the fields can be defined that also depend on the type of the 
> {{DataStream}} / {{DataSet}}. Types with explicit field ordering (e.g., 
> tuples, case classes, Row) require schema definition based on the position of 
> fields. Pojo types which have no fixed order of fields, require to refer to 
> fields by name. Moreover, there are several restrictions on how time 
> attributes can be defined, e.g., event time attribute must replace an 
> existing field or be appended and proctime attributes must be appended.
> I think we can make the schema definition more flexible and provide two modes:
> 1. Reference input fields by name: All fields in the schema definition are 
> referenced by name (and possibly renamed using an alias ({{as}}). In this 
> mode, fields can be reordered and projected out. Moreover, we can define 
> proctime and eventtime attributes at arbitrary positions using arbitrary 
> names (except those that existing the result schema). This mode can be used 
> for any input type, including POJOs. This mode is used if all field 
> references exist in the input type.
> 2. Reference input fields by position: Field references might not refer to 
> existing fields in the input type. In this mode, fields are simply renamed. 
> Event-time attributes can replace the field on their position in the input 
> data (if it is of correct type) or be appended at the end. Proctime 
> attributes must be appended at the end. This mode can only be used if the 
> input type has a defined field order (tuple, case class, Row).
> We need to add more tests the check for all combinations of input types and 
> schema definition modes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to