Dear Aviem,

That's a good point. TextIO seems to make a few "string oriented"
assumptions, see TextIO.Write's header and footer support
<https://github.com/apache/beam/pull/918>, and IO design pattern: Decouple
Parsers and Coders <https://issues.apache.org/jira/browse/BEAM-73>
(BEAM-73) that
came up during its development.

IMHO it would be nice to make TextIO purely about textual content, and
perhaps go as far as remove the ability to pass in a coder. To support
encoded files, one could consider something like FileIO which gets a coder
and writes/reads the encoded/decoded content to/from a file. For example,
AvroIO could be thought of as a FileIO with an AvroCoder.

IO authors, does this sit well with what you had in mind?

-Stas



On Mon, Jan 30, 2017 at 10:24 AM Aviem Zur <aviem...@gmail.com> wrote:

> Hi,
>
> While trying to use TextIO to write/read a binary file rather than String
> lines from a textual file I ran into an issue - the delimiter TextIO uses
> seems to be hardcoded '\n'.
> See `findSeparatorBounds` -
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
>
> The use case is to have a file of objects, encoded into bytes using a
> coder. However, '\n' is not a good delimiter here, as you can imagine.
> A similar pattern is found in Spark's `saveAsObjectFile`
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> where
> they use a more appropriate delimiter, to avoid such issues.
>
> I did not find any unit tests which use TextIO to read anything other than
> Strings.
>

Reply via email to