[
https://issues.apache.org/jira/browse/BEAM-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709427#comment-16709427
]
Tanguy MONFORT commented on BEAM-51:
------------------------------------
Hello,
I think, the CSV file reader should also be able to given the names of the
columns when a header exists. Like a database reader in which we are able to
read a field given its name. This gives more flexibility when new columns are
added to an input, which is often the case in production systems.
Although I'm new in Beam, I feel that the CsvDataFormatFn() proposed here can
be interesting as well to parse unbounded CSV flows (i.e. not only files).
Thanks.
> Implement a CSV file reader
> ---------------------------
>
> Key: BEAM-51
> URL: https://issues.apache.org/jira/browse/BEAM-51
> Project: Beam
> Issue Type: New Feature
> Components: io-ideas
> Reporter: Daniel Halperin
> Priority: Minor
>
> We should implement a CSV-based source.
> One possibility would be to support the same options as BigQuery.
> https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats
> These options are:
> fieldDelimiter: allowing a custom delimiter... csv vs tsv, etc. My guess is
> this is critical. One common delimiter that people use is 'thorn' (þ).
> quote: Custom quote char. By default, this is '"', but this allows users to
> set it to something else, or, perhaps more commonly, remove it entirely (by
> setting it to the empty string). For example, tab-separated files generally
> don't need quotes.
> allowQuotedNewlines: whether you can quote newlines. In the official CSV RFC,
> newlines can be quoted.. that is, you can have "a", "b\n", "c" in a single
> line. This makes splitting of large csv files impossible, so we should
> disallow quoted newlines by default unless the user really wants them (in
> which case, they'll get worse performance).
> allowJaggedRows: This allows inferring null if not enough columns are
> specified. Otherwise we give an error for the row.
> ignoreUnknownValues: The opposite of allowJaggedRows, this means that if a
> user has _too_ many values for the schema, we will ignore the ones we don't
> recognize, rather than reporting an error for the row.
> skipHeaderRows: How many header lines are in the file.
> encoding: UTF8-vs latin1, etc.
> compression: gzip, bzip, etc.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)