P.S. Note that this point (about coders) is also mentioned in the now-being-reviewed PTransform Style Guide https://github.com/apache/beam-site/pull/134 currently staged at http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <[email protected]> wrote: > +1 to what Eugene said. > > I've seen a number of Python SDK users incorrectly assuming that > coder.decode() is needed when developing their own file-based sources > (since many users usually refer to text source first). Probably coder > parameter should not be configurable for text source/sink and they should > be updated to only read/write UTF-8 encoded strings. > > - Cham > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov > <[email protected]> wrote: > > > The use of Coder in TextIO is a long standing design issue because coders > > are not intended to be used for general purpose converting things from > and > > to bytes, their only proper use is letting the runner materialize and > > restore objects if the runner thinks it's necessary. IMO it should have > > been called LineIO, document that it reads lines of text as String, and > not > > have a withCoder parameter at all. > > > > The proper way to address your use case is to write a custom > > FileBasedSource. > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <[email protected]> wrote: > > > > > The Javadoc of TextIO states: > > > > > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of > > > {@link String Strings}, > > > * each corresponding to one line of an input UTF-8 text file. To > convert > > > directly from the raw > > > * bytes (split into lines delimited by '\n', '\r', or '\r\n') to > another > > > object of type {@code T}, > > > * supply a {@code Coder<T>} using {@link > TextIO.Read#withCoder(Coder)}. > > > > > > However, as I stated, `withCoder` doesn't seem to have tests, and > > probably > > > won't work given the hard-coded '\n' delimiter. > > > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <[email protected] > > > > > wrote: > > > > > > > Hi Aviem, > > > > > > > > TextIO is not designed to write/read binary file: it's pure Text, so > > > > String. > > > > > > > > Regards > > > > JB > > > > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote: > > > > > Hi, > > > > > > > > > > While trying to use TextIO to write/read a binary file rather than > > > String > > > > > lines from a textual file I ran into an issue - the delimiter > TextIO > > > uses > > > > > seems to be hardcoded '\n'. > > > > > See `findSeparatorBounds` - > > > > > > > > > > > > > > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024 > > > > > > > > > > The use case is to have a file of objects, encoded into bytes > using a > > > > > coder. However, '\n' is not a good delimiter here, as you can > > imagine. > > > > > A similar pattern is found in Spark's `saveAsObjectFile` > > > > > > > > > > > > > > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512 > > > > > where > > > > > they use a more appropriate delimiter, to avoid such issues. > > > > > > > > > > I did not find any unit tests which use TextIO to read anything > other > > > > than > > > > > Strings. > > > > > > > > > > > > > -- > > > > Jean-Baptiste Onofré > > > > [email protected] > > > > http://blog.nanthrax.net > > > > Talend - http://www.talend.com > > > > > > > > > >
