[
https://issues.apache.org/jira/browse/BEAM-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709427#comment-16709427
]
Tanguy MONFORT edited comment on BEAM-51 at 12/5/18 12:14 AM:
--------------------------------------------------------------
Hello,
I think, the CSV file reader should also be able to given the names of the
columns when a header exists. Like a database reader in which we are able to
read a field given its name. This gives more flexibility when new columns are
added to an input, which is often the case in production systems (some people
will want to skip the header, but others will need to remember the names of the
columns).
Although I'm new in Beam, I feel that the CsvDataFormatFn() proposed here can
be interesting as well to parse unbounded CSV flows (i.e. not only files).
Thanks.
was (Author: tanguy monfort):
Hello,
I think, the CSV file reader should also be able to given the names of the
columns when a header exists. Like a database reader in which we are able to
read a field given its name. This gives more flexibility when new columns are
added to an input, which is often the case in production systems.
Although I'm new in Beam, I feel that the CsvDataFormatFn() proposed here can
be interesting as well to parse unbounded CSV flows (i.e. not only files).
Thanks.
> Implement a CSV file reader
> ---------------------------
>
> Key: BEAM-51
> URL: https://issues.apache.org/jira/browse/BEAM-51
> Project: Beam
> Issue Type: New Feature
> Components: io-ideas
> Reporter: Daniel Halperin
> Priority: Minor
>
> We should implement a CSV-based source.
> One possibility would be to support the same options as BigQuery.
> https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats
> These options are:
> fieldDelimiter: allowing a custom delimiter... csv vs tsv, etc. My guess is
> this is critical. One common delimiter that people use is 'thorn' (þ).
> quote: Custom quote char. By default, this is '"', but this allows users to
> set it to something else, or, perhaps more commonly, remove it entirely (by
> setting it to the empty string). For example, tab-separated files generally
> don't need quotes.
> allowQuotedNewlines: whether you can quote newlines. In the official CSV RFC,
> newlines can be quoted.. that is, you can have "a", "b\n", "c" in a single
> line. This makes splitting of large csv files impossible, so we should
> disallow quoted newlines by default unless the user really wants them (in
> which case, they'll get worse performance).
> allowJaggedRows: This allows inferring null if not enough columns are
> specified. Otherwise we give an error for the row.
> ignoreUnknownValues: The opposite of allowJaggedRows, this means that if a
> user has _too_ many values for the schema, we will ignore the ones we don't
> recognize, rather than reporting an error for the row.
> skipHeaderRows: How many header lines are in the file.
> encoding: UTF8-vs latin1, etc.
> compression: gzip, bzip, etc.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)