[ 
https://issues.apache.org/jira/browse/BEAM-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545921#comment-17545921
 ] 

Danny McCormick commented on BEAM-51:
-------------------------------------

This isn't actually real, but this issue has been migrated to 
https://github.com/apache/beam/issues/17832

> Implement a CSV file reader
> ---------------------------
>
>                 Key: BEAM-51
>                 URL: https://issues.apache.org/jira/browse/BEAM-51
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-ideas
>            Reporter: Dan Halperin
>            Priority: P3
>
> We should implement a CSV-based source.
> One possibility would be to support the same options as BigQuery. 
> https://cloud.google.com/bigquery/preparing-data-for-bigquery#dataformats 
> These options are:
> fieldDelimiter: allowing a custom delimiter... csv vs tsv, etc. My guess is 
> this is critical. One common delimiter that people use is 'thorn' (รพ).
> quote: Custom quote char. By default, this is '"', but this allows users to 
> set it to something else, or, perhaps more commonly, remove it entirely (by 
> setting it to the empty string). For example, tab-separated files generally 
> don't need quotes.
> allowQuotedNewlines: whether you can quote newlines. In the official CSV RFC, 
> newlines can be quoted.. that is, you can have "a", "b\n", "c" in a single 
> line. This makes splitting of large csv files impossible, so we should 
> disallow quoted newlines by default unless the user really wants them (in 
> which case, they'll get worse performance).
> allowJaggedRows: This allows inferring null if not enough columns are 
> specified. Otherwise we give an error for the row.
> ignoreUnknownValues: The opposite of allowJaggedRows, this means that if a 
> user has _too_ many values for the schema, we will ignore the ones we don't 
> recognize, rather than reporting an error for the row.
> skipHeaderRows: How many header lines are in the file.
> encoding: UTF8-vs latin1, etc.
> compression: gzip, bzip, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to