Stas' comment is the right one. The "canonical" use of TextIO is using
something like a TextualIntegerCoder
<https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/TextualIntegerCoder.java>,
but that should almost certainly be replaced with TextIO.Read |
ParDo.of(Parse integer). The `withCoder` functions need to get removed or
replaced.

For "holding a file of arbitrary records" -- simply producing a
delimiter-separated TextIO is probably not a good choice. Specifically,
splitting is broken when the delimiter might appear in the output (e.g.,
when using almost any coder). A better option is to design a file format to
hold arbitrary records. E.g., an Avro file where each record is just a
byte[].

Dan

On Mon, Jan 30, 2017 at 2:52 AM, Aviem Zur <[email protected]> wrote:

> The Javadoc of TextIO states:
>
> * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> {@link String Strings},
>  * each corresponding to one line of an input UTF-8 text file. To convert
> directly from the raw
>  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
> object of type {@code T},
>  * supply a {@code Coder<T>} using {@link TextIO.Read#withCoder(Coder)}.
>
> However, as I stated, `withCoder` doesn't seem to have tests, and probably
> won't work given the hard-coded '\n' delimiter.
>
> On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> > Hi Aviem,
> >
> > TextIO is not designed to write/read binary file: it's pure Text, so
> > String.
> >
> > Regards
> > JB
> >
> > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > Hi,
> > >
> > > While trying to use TextIO to write/read a binary file rather than
> String
> > > lines from a textual file I ran into an issue - the delimiter TextIO
> uses
> > > seems to be hardcoded '\n'.
> > > See `findSeparatorBounds` -
> > >
> > https://github.com/apache/beam/blob/master/sdks/java/
> core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > >
> > > The use case is to have a file of objects, encoded into bytes using a
> > > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > > A similar pattern is found in Spark's `saveAsObjectFile`
> > >
> > https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > where
> > > they use a more appropriate delimiter, to avoid such issues.
> > >
> > > I did not find any unit tests which use TextIO to read anything other
> > than
> > > Strings.
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > [email protected]
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Reply via email to