[
https://issues.apache.org/jira/browse/BEAM-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eugene Kirpichov updated BEAM-2776:
-----------------------------------
Description:
Users frequently request the ability to skip some header rows when reading text
files.
https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow
This is also relevant for reading file formats such as VCF, see thread
https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E
Python supports this partially https://github.com/apache/beam/pull/1771/files
via skip_header_lines, but the header lines can have useful content, and the
number of header lines is not fixed (in VCF).
We should figure out a good API for this and support this natively in TextIO.
The API decisions would be:
- How do we specify how much of the beginning of each file is the header:
options could be e.g. a certain number of lines; or lines that start with a
certain character; or a custom predicate.
- How do we make the header contents accessible to a user of TextIO. Since the
header can be different in each file, we can't return it as a
PCollectionView<List<String>>. Instead I suppose, when you use a header, you'd
need to specify a SerializableFunction<KV<List<String>, String>, T> or
something like that for parsing (header, line) -> user type. Note that
currently TextIO.Read does not support returning a user type anyway, so that'd
need to be done too.
was:
Users frequently request the ability to skip some header rows when reading text
files.
https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow
This is also relevant for reading file formats such as VCF, see thread
https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E
Python supports this partially https://github.com/apache/beam/pull/1771/files
via skip_header_lines, but the header lines can have useful content, and the
number of header lines is not fixed (in VCF).
We should figure out a good API for this and support this natively in TextIO.
The API decisions would be:
- How do we specify how much of the beginning of each file is the header:
options could be e.g. a certain number of lines; or lines that start with a
certain character.
- How do we make the header contents accessible to a user of TextIO. Since the
header can be different in each file, we can't return it as a
PCollectionView<List<String>>. Instead I suppose, when you use a header, you'd
need to specify a SerializableFunction<KV<List<String>, String>, T> or
something like that for parsing (header, line) -> user type. Note that
currently TextIO.Read does not support returning a user type anyway, so that'd
need to be done too.
> TextIO should support reading header lines
> ------------------------------------------
>
> Key: BEAM-2776
> URL: https://issues.apache.org/jira/browse/BEAM-2776
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-core
> Reporter: Eugene Kirpichov
>
> Users frequently request the ability to skip some header rows when reading
> text files.
> https://stackoverflow.com/questions/28450554/skipping-header-rows-is-it-possible-with-cloud-dataflow
> https://stackoverflow.com/questions/43551876/how-do-i-read-and-transform-csv-headers-before-bigqueryio-write
> https://stackoverflow.com/questions/41297704/reading-csv-header-with-dataflow
> https://stackoverflow.com/questions/45554466/google-cloud-dataflow-apache-beam-how-to-process-gzipped-csv-files-with-a-he
> https://stackoverflow.com/questions/44045744/how-do-i-skip-header-files-when-reading-from-google-cloud-storage-in-a-dataflow
> This is also relevant for reading file formats such as VCF, see thread
> https://lists.apache.org/thread.html/dc7e5c3ff20d9270f06c1a298ad949da018a83f900b22d58f6b4c468@%3Cdev.beam.apache.org%3E
> Python supports this partially https://github.com/apache/beam/pull/1771/files
> via skip_header_lines, but the header lines can have useful content, and the
> number of header lines is not fixed (in VCF).
> We should figure out a good API for this and support this natively in TextIO.
> The API decisions would be:
> - How do we specify how much of the beginning of each file is the header:
> options could be e.g. a certain number of lines; or lines that start with a
> certain character; or a custom predicate.
> - How do we make the header contents accessible to a user of TextIO. Since
> the header can be different in each file, we can't return it as a
> PCollectionView<List<String>>. Instead I suppose, when you use a header,
> you'd need to specify a SerializableFunction<KV<List<String>, String>, T> or
> something like that for parsing (header, line) -> user type. Note that
> currently TextIO.Read does not support returning a user type anyway, so
> that'd need to be done too.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)