The use of Coder in TextIO is a long standing design issue because coders are not intended to be used for general purpose converting things from and to bytes, their only proper use is letting the runner materialize and restore objects if the runner thinks it's necessary. IMO it should have been called LineIO, document that it reads lines of text as String, and not have a withCoder parameter at all.
The proper way to address your use case is to write a custom FileBasedSource. On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <[email protected]> wrote: > The Javadoc of TextIO states: > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of > {@link String Strings}, > * each corresponding to one line of an input UTF-8 text file. To convert > directly from the raw > * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another > object of type {@code T}, > * supply a {@code Coder<T>} using {@link TextIO.Read#withCoder(Coder)}. > > However, as I stated, `withCoder` doesn't seem to have tests, and probably > won't work given the hard-coded '\n' delimiter. > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <[email protected]> > wrote: > > > Hi Aviem, > > > > TextIO is not designed to write/read binary file: it's pure Text, so > > String. > > > > Regards > > JB > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote: > > > Hi, > > > > > > While trying to use TextIO to write/read a binary file rather than > String > > > lines from a textual file I ran into an issue - the delimiter TextIO > uses > > > seems to be hardcoded '\n'. > > > See `findSeparatorBounds` - > > > > > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024 > > > > > > The use case is to have a file of objects, encoded into bytes using a > > > coder. However, '\n' is not a good delimiter here, as you can imagine. > > > A similar pattern is found in Spark's `saveAsObjectFile` > > > > > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512 > > > where > > > they use a more appropriate delimiter, to avoid such issues. > > > > > > I did not find any unit tests which use TextIO to read anything other > > than > > > Strings. > > > > > > > -- > > Jean-Baptiste Onofré > > [email protected] > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > >
