[
https://issues.apache.org/jira/browse/BEAM-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347457#comment-15347457
]
Caio Iglesias commented on BEAM-51:
-----------------------------------
related: https://issues.apache.org/jira/browse/BEAM-123
> Implement a CSV file reader
> ---------------------------
>
> Key: BEAM-51
> URL: https://issues.apache.org/jira/browse/BEAM-51
> Project: Beam
> Issue Type: New Feature
> Components: sdk-java-extensions
> Reporter: Daniel Halperin
> Priority: Minor
>
> We should implement a CSV-based source.
> One possibility would be to support the same options as BigQuery.
> https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats
> These options are:
> fieldDelimiter: allowing a custom delimiter... csv vs tsv, etc. My guess is
> this is critical. One common delimiter that people use is 'thorn' (รพ).
> quote: Custom quote char. By default, this is '"', but this allows users to
> set it to something else, or, perhaps more commonly, remove it entirely (by
> setting it to the empty string). For example, tab-separated files generally
> don't need quotes.
> allowQuotedNewlines: whether you can quote newlines. In the official CSV RFC,
> newlines can be quoted.. that is, you can have "a", "b\n", "c" in a single
> line. This makes splitting of large csv files impossible, so we should
> disallow quoted newlines by default unless the user really wants them (in
> which case, they'll get worse performance).
> allowJaggedRows: This allows inferring null if not enough columns are
> specified. Otherwise we give an error for the row.
> ignoreUnknownValues: The opposite of allowJaggedRows, this means that if a
> user has _too_ many values for the schema, we will ignore the ones we don't
> recognize, rather than reporting an error for the row.
> skipHeaderRows: How many header lines are in the file.
> encoding: UTF8-vs latin1, etc.
> compression: gzip, bzip, etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)