[jira] [Commented] (BEAM-51) Implement a CSV file reader

Tanguy MONFORT (JIRA) Tue, 04 Dec 2018 16:13:22 -0800


    [ 
https://issues.apache.org/jira/browse/BEAM-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709427#comment-16709427
 ]


Tanguy MONFORT commented on BEAM-51:
------------------------------------

Hello,

I think, the CSV file reader should also be able to given the names of the 
columns when a header exists. Like a database reader in which we are able to 
read a field given its name. This gives more flexibility when new columns are 
added to an input, which is often the case in production systems.

Although I'm new in Beam, I feel that the CsvDataFormatFn() proposed here can 
be interesting as well to parse unbounded CSV flows (i.e. not only files).

Thanks.

> Implement a CSV file reader
> ---------------------------
>
>                 Key: BEAM-51
>                 URL: https://issues.apache.org/jira/browse/BEAM-51
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-ideas
>            Reporter: Daniel Halperin
>            Priority: Minor
>
> We should implement a CSV-based source.
> One possibility would be to support the same options as BigQuery. 
> https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats 
> These options are:
> fieldDelimiter: allowing a custom delimiter... csv vs tsv, etc. My guess is 
> this is critical. One common delimiter that people use is 'thorn' (þ).
> quote: Custom quote char. By default, this is '"', but this allows users to 
> set it to something else, or, perhaps more commonly, remove it entirely (by 
> setting it to the empty string). For example, tab-separated files generally 
> don't need quotes.
> allowQuotedNewlines: whether you can quote newlines. In the official CSV RFC, 
> newlines can be quoted.. that is, you can have "a", "b\n", "c" in a single 
> line. This makes splitting of large csv files impossible, so we should 
> disallow quoted newlines by default unless the user really wants them (in 
> which case, they'll get worse performance).
> allowJaggedRows: This allows inferring null if not enough columns are 
> specified. Otherwise we give an error for the row.
> ignoreUnknownValues: The opposite of allowJaggedRows, this means that if a 
> user has _too_ many values for the schema, we will ignore the ones we don't 
> recognize, rather than reporting an error for the row.
> skipHeaderRows: How many header lines are in the file.
> encoding: UTF8-vs latin1, etc.
> compression: gzip, bzip, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-51) Implement a CSV file reader

Reply via email to