[jira] [Commented] (FLINK-2988) Cannot load DataSet[Row] from CSV file

Johann Kovacs (JIRA) Wed, 25 Nov 2015 05:13:31 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15026735#comment-15026735
 ]


Johann Kovacs commented on FLINK-2988:
--------------------------------------

>From my point of view this problem still persists (unless this has been 
>addressed in some other issue?). For context and examples, here is the mailing 
>list thread that preceded this issue: 
>http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-best-to-deal-with-wide-structured-tuples-td3300.html

This problem isn't about nullable values, per se. Basically I want to have an 
API that lets me work easily with records more than 25 columns wide. The 
{{Row}} class from the Table API supports this very well already, but I'm 
unable to load a CSV file as a {{DataSet<Row>}}. This should be either somehow 
supported by the {{ExecutionEnvironment#readCsvFile}} method (e.g. using a 
quick fix like I proposed in the mailing list thread), or in the 
{{TableEnvironment}} API (e.g. similarly to how there's a 
{{Graph#fromCsvReader()}} method in Gelly).

Using POJOs is not a proper solution for this use-case because:
* I might not know the schema during compile time,
* Even if I did, creating POJOs by hand with potentially dozens or hundreds of 
fields is cumbersome and not maintainable,
* I would not only need to create POJOs for the input schema of my file, but 
also for each intermediate schema. Imagine a mapper function that simply 
appends a column to a {{DataSet<MyHugePojo>}}. I have to create a new POJO 
MyHugeIntermediatePojo, which adds the new field at the end, in addition to all 
the other fields from the input schema (or, alternatively, use nested POJOs.) 
Both solutions are not nice to work with, compared with index-based access that 
{{Tuples}} and {{Rows}} provide.

> Cannot load DataSet[Row] from CSV file
> --------------------------------------
>
>                 Key: FLINK-2988
>                 URL: https://issues.apache.org/jira/browse/FLINK-2988
>             Project: Flink
>          Issue Type: Improvement
>          Components: DataSet API, Table API
>    Affects Versions: 0.10.0
>            Reporter: Johann Kovacs
>            Priority: Minor
>
> Tuple classes (Java/Scala both) only have arity up to 25, meaning I cannot 
> load a CSV file with more than 25 columns directly as a 
> DataSet\[TupleX\[...\]\].
> An alternative to using Tuples is using the Table API's Row class, which 
> allows for arbitrary-length, arbitrary-type, runtime-supplied schemata (using 
> RowTypeInfo) and index-based access.
> However, trying to load a CSV file as a DataSet\[Row\] yields an exception:
> {code}
> val env = ExecutionEnvironment.createLocalEnvironment()
> val filePath = "../someCsv.csv"
> val typeInfo = new RowTypeInfo(Seq(BasicTypeInfo.STRING_TYPE_INFO, 
> BasicTypeInfo.INT_TYPE_INFO), Seq("word", "number"))
> val source = env.readCsvFile(filePath)(ClassTag(classOf[Row]), typeInfo)
> println(source.collect())
> {code}
> with someCsv.csv containing:
> {code}
> one,1
> two,2
> {code}
> yields
> {code}
> Exception in thread "main" java.lang.ClassCastException: 
> org.apache.flink.api.table.typeinfo.RowSerializer cannot be cast to 
> org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase
>       at 
> org.apache.flink.api.scala.operators.ScalaCsvInputFormat.<init>(ScalaCsvInputFormat.java:46)
>       at 
> org.apache.flink.api.scala.ExecutionEnvironment.readCsvFile(ExecutionEnvironment.scala:282)
> {code}
> As a user I would like to be able to load a CSV file into a DataSet\[Row\], 
> preferably having a convenience method to specify the schema (RowTypeInfo), 
> without having to use the "explicit implicit parameters" syntax and 
> specifying the ClassTag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2988) Cannot load DataSet[Row] from CSV file

Reply via email to