[jira] [Commented] (BEAM-51) Implement a CSV file reader

Caio Iglesias (JIRA) Thu, 23 Jun 2016 17:02:18 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347457#comment-15347457
 ]


Caio Iglesias commented on BEAM-51:
-----------------------------------

related: https://issues.apache.org/jira/browse/BEAM-123

> Implement a CSV file reader
> ---------------------------
>
>                 Key: BEAM-51
>                 URL: https://issues.apache.org/jira/browse/BEAM-51
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>            Reporter: Daniel Halperin
>            Priority: Minor
>
> We should implement a CSV-based source.
> One possibility would be to support the same options as BigQuery. 
> https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats 
> These options are:
> fieldDelimiter: allowing a custom delimiter... csv vs tsv, etc. My guess is 
> this is critical. One common delimiter that people use is 'thorn' (þ).
> quote: Custom quote char. By default, this is '"', but this allows users to 
> set it to something else, or, perhaps more commonly, remove it entirely (by 
> setting it to the empty string). For example, tab-separated files generally 
> don't need quotes.
> allowQuotedNewlines: whether you can quote newlines. In the official CSV RFC, 
> newlines can be quoted.. that is, you can have "a", "b\n", "c" in a single 
> line. This makes splitting of large csv files impossible, so we should 
> disallow quoted newlines by default unless the user really wants them (in 
> which case, they'll get worse performance).
> allowJaggedRows: This allows inferring null if not enough columns are 
> specified. Otherwise we give an error for the row.
> ignoreUnknownValues: The opposite of allowJaggedRows, this means that if a 
> user has _too_ many values for the schema, we will ignore the ones we don't 
> recognize, rather than reporting an error for the row.
> skipHeaderRows: How many header lines are in the file.
> encoding: UTF8-vs latin1, etc.
> compression: gzip, bzip, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-51) Implement a CSV file reader

Reply via email to