The Javadoc of TextIO states:

* <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
{@link String Strings},
 * each corresponding to one line of an input UTF-8 text file. To convert
directly from the raw
 * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
object of type {@code T},
 * supply a {@code Coder<T>} using {@link TextIO.Read#withCoder(Coder)}.

However, as I stated, `withCoder` doesn't seem to have tests, and probably
won't work given the hard-coded '\n' delimiter.

On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Aviem,
>
> TextIO is not designed to write/read binary file: it's pure Text, so
> String.
>
> Regards
> JB
>
> On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > Hi,
> >
> > While trying to use TextIO to write/read a binary file rather than String
> > lines from a textual file I ran into an issue - the delimiter TextIO uses
> > seems to be hardcoded '\n'.
> > See `findSeparatorBounds` -
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> >
> > The use case is to have a file of objects, encoded into bytes using a
> > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > A similar pattern is found in Spark's `saveAsObjectFile`
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > where
> > they use a more appropriate delimiter, to avoid such issues.
> >
> > I did not find any unit tests which use TextIO to read anything other
> than
> > Strings.
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Reply via email to