Christopher Hebert created BEAM-2586:
----------------------------------------
Summary: Accommodate custom delimiters in TextIO
Key: BEAM-2586
URL: https://issues.apache.org/jira/browse/BEAM-2586
Project: Beam
Issue Type: New Feature
Components: sdk-java-core
Reporter: Christopher Hebert
Assignee: Davor Bonaci
Priority: Minor
We frequently process text files delimited by something other than newlines,
including delimited only by end of file.
First option:
When we want to delimit by commas (or something else), we could use TextIO to
read in line by line and apply a transform to split each line on commas. When
we want to delimit by whole file, we could combine the elements of the
PCollection output from TextIO that come from the same file into one element.
Second option:
Alternatively to complicating (and slowing) our pipelines with the methods
above, we could write custom FileBasedSources for each use case.
Third option:
Preferably, we'd like to generalize TextIO to accept delimiters other than the
default: \n, \r, \r\n.
I'll attach a pull request for how we envision this generalization of TextIO to
look.
If this is not the direction Beam would like to go with TextIO, then we'll
stick to maintaining our own TextIO or our own FileBasedSources to achieve this
functionality.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)