[jira] [Updated] (BEAM-2776) TextIO should support reading header lines

Eugene Kirpichov (JIRA) Thu, 17 Aug 2017 16:23:04 -0700

     [ 
https://issues.apache.org/jira/browse/BEAM-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eugene Kirpichov updated BEAM-2776:
-----------------------------------
    Description: 
Users frequently request the ability to skip some header rows when reading text 
files.


https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow

This is also relevant for reading file formats such as VCF, see thread 
https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E

Python supports this partially https://github.com/apache/beam/pull/1771/files 
via skip_header_lines, but the header lines can have useful content, and the 
number of header lines is not fixed (in VCF).

We should figure out a good API for this and support this natively in TextIO. 
The API decisions would be:

- How do we specify how much of the beginning of each file is the header: 
options could be e.g. a certain number of lines; or lines that start with a 
certain character; or a custom predicate.
- How do we make the header contents accessible to a user of TextIO. Since the 
header can be different in each file, we can't return it as a 
PCollectionView<List<String>>. Instead I suppose, when you use a header, you'd 
need to specify a SerializableFunction<KV<List<String>, String>, T> or 
something like that for parsing (header, line) -> user type. Note that 
currently TextIO.Read does not support returning a user type anyway, so that'd 
need to be done too.

  was:
Users frequently request the ability to skip some header rows when reading text 
files.


https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow

This is also relevant for reading file formats such as VCF, see thread 
https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E

Python supports this partially https://github.com/apache/beam/pull/1771/files 
via skip_header_lines, but the header lines can have useful content, and the 
number of header lines is not fixed (in VCF).

We should figure out a good API for this and support this natively in TextIO. 
The API decisions would be:

- How do we specify how much of the beginning of each file is the header: 
options could be e.g. a certain number of lines; or lines that start with a 
certain character.
- How do we make the header contents accessible to a user of TextIO. Since the 
header can be different in each file, we can't return it as a 
PCollectionView<List<String>>. Instead I suppose, when you use a header, you'd 
need to specify a SerializableFunction<KV<List<String>, String>, T> or 
something like that for parsing (header, line) -> user type. Note that 
currently TextIO.Read does not support returning a user type anyway, so that'd 
need to be done too.


> TextIO should support reading header lines
> ------------------------------------------
>
>                 Key: BEAM-2776
>                 URL: https://issues.apache.org/jira/browse/BEAM-2776
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>
> Users frequently request the ability to skip some header rows when reading 
> text files.
> https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
> https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
> https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
> https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
> https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow
> This is also relevant for reading file formats such as VCF, see thread 
> https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E
> Python supports this partially https://github.com/apache/beam/pull/1771/files 
> via skip_header_lines, but the header lines can have useful content, and the 
> number of header lines is not fixed (in VCF).
> We should figure out a good API for this and support this natively in TextIO. 
> The API decisions would be:
> - How do we specify how much of the beginning of each file is the header: 
> options could be e.g. a certain number of lines; or lines that start with a 
> certain character; or a custom predicate.
> - How do we make the header contents accessible to a user of TextIO. Since 
> the header can be different in each file, we can't return it as a 
> PCollectionView<List<String>>. Instead I suppose, when you use a header, 
> you'd need to specify a SerializableFunction<KV<List<String>, String>, T> or 
> something like that for parsing (header, line) -> user type. Note that 
> currently TextIO.Read does not support returning a user type anyway, so 
> that'd need to be done too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (BEAM-2776) TextIO should support reading header lines

Reply via email to