[jira] [Commented] (FLINK-2692) Untangle CsvInputFormat into PojoTypeCsvInputFormat and TupleTypeCsvInputFormat

ASF GitHub Bot (JIRA) Sun, 18 Oct 2015 15:35:45 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962663#comment-14962663
 ]


ASF GitHub Bot commented on FLINK-2692:
---------------------------------------

GitHub user zentol opened a pull request:

    https://github.com/apache/flink/pull/1266

    [FLINK-2692] Untangle CsvInputFormat

    This PR splits the CsvInputFormat into a Tuple and POJO Version. To this 
end, The (Common)CsvInputFormat classes were merged, and the type specific 
portions refactored into separate classes.
    
    Additionally, the ScalaCsvInputFormat has been removed; Java and Scala API 
now use the same InputFormats. Previously, the formats differed in the way they 
created the output tuples; this is now realized in a newly introduced abstract 
method "createOrReuseInstance(Object[] fieldValues, T reuse)" within the 
TupleSerializerBase.
    
    Fields to include and field names are no longer passed via setters, but 
instead via the contructor. Several new contructors were added to accommodate 
different use cases, along with 2 new static methods to generate a default 
include mask, or convert an indice int[] list to a boolean include mask.
    
    Classes no longer have to be passed separately, as they are extracted from 
the typeinformation object.
    
    A few sanity checks were moved from the ExecEnvironment to the InputFormat.
    
    The testReadSparseWithShuffledPositions Test was removed since monotonous 
order of field indices is, and afaik was, not actually necessary due to the way 
it was converted to a boolean[].

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zentol/flink 2692_csv

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1266.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1266
    
----
commit d497415adc2e58b4e9912ae89a53444825416366
Author: zentol <[email protected]>
Date:   2015-10-18T18:23:23Z

    [FLINK-2692] Untangle CsvInputFormat

----


> Untangle CsvInputFormat into PojoTypeCsvInputFormat and 
> TupleTypeCsvInputFormat 
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-2692
>                 URL: https://issues.apache.org/jira/browse/FLINK-2692
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Till Rohrmann
>            Assignee: Chesnay Schepler
>            Priority: Minor
>
> The {{CsvInputFormat}} currently allows to return values as a {{Tuple}} or a 
> {{Pojo}} type. As a consequence, the processing logic, which has to work for 
> both types, is overly complex. For example, the {{CsvInputFormat}} contains 
> fields which are only used when a Pojo is returned. Moreover, the pojo field 
> information are constructed by calling setter methods which have to be called 
> in a very specific order, otherwise they fail. E.g. one first has to call 
> {{setFieldTypes}} before calling {{setOrderOfPOJOFields}}, otherwise the 
> number of fields might be different. Furthermore, some of the methods can 
> only be called if the return type is a {{Pojo}} type, because they expect 
> that a {{PojoTypeInfo}} is present.
> I think the {{CsvInputFormat}} should be refactored to make the code more 
> easily maintainable. I propose to split it up into a 
> {{PojoTypeCsvInputFormat}} and a {{TupleTypeCsvInputFormat}} which take all 
> the required information via their constructors instead of using the 
> {{setFields}} and {{setOrderOfPOJOFields}} approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2692) Untangle CsvInputFormat into PojoTypeCsvInputFormat and TupleTypeCsvInputFormat

Reply via email to