Looks like Eugene addressed this in the following ticket:
https://issues.apache.org/jira/browse/BEAM-1354
Just added a bullet regarding updating the javadoc.

On Tue, Jan 31, 2017 at 12:47 PM Aviem Zur <aviem...@gmail.com> wrote:

> So If I understand the general agreement is that TextIO should not support
> anything but lines from files as strings.
> I'll go ahead and file a ticket that says the Javadoc should be changed to
> reflect this and `withCoder` method should be removed.
>
> Is there merit for Beam to supply an IO which does allow writing objects
> to a file using Beam coders and Beam FS (To write these files to
> GS/Hadoop/Local)?
>
> On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> <kirpic...@google.com.invalid> wrote:
>
> P.S. Note that this point (about coders) is also mentioned in the
> now-being-reviewed PTransform Style Guide
> https://github.com/apache/beam-site/pull/134
> currently staged at
>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
>
>
> On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <chamik...@apache.org>
> wrote:
>
> > +1 to what Eugene said.
> >
> > I've seen a number of Python SDK users incorrectly assuming that
> > coder.decode() is needed when developing their own file-based sources
> > (since many users usually refer to text source first). Probably coder
> > parameter should not be configurable for text source/sink and they should
> > be updated to only read/write UTF-8 encoded strings.
> >
> > - Cham
> >
> > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > <kirpic...@google.com.invalid> wrote:
> >
> > > The use of Coder in TextIO is a long standing design issue because
> coders
> > > are not intended to be used for general purpose converting things from
> > and
> > > to bytes, their only proper use is letting the runner materialize and
> > > restore objects if the runner thinks it's necessary. IMO it should have
> > > been called LineIO, document that it reads lines of text as String, and
> > not
> > > have a withCoder parameter at all.
> > >
> > > The proper way to address your use case is to write a custom
> > > FileBasedSource.
> > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <aviem...@gmail.com> wrote:
> > >
> > > > The Javadoc of TextIO states:
> > > >
> > > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > > {@link String Strings},
> > > >  * each corresponding to one line of an input UTF-8 text file. To
> > convert
> > > > directly from the raw
> > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > another
> > > > object of type {@code T},
> > > >  * supply a {@code Coder<T>} using {@link
> > TextIO.Read#withCoder(Coder)}.
> > > >
> > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > probably
> > > > won't work given the hard-coded '\n' delimiter.
> > > >
> > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Hi Aviem,
> > > > >
> > > > > TextIO is not designed to write/read binary file: it's pure Text,
> so
> > > > > String.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > Hi,
> > > > > >
> > > > > > While trying to use TextIO to write/read a binary file rather
> than
> > > > String
> > > > > > lines from a textual file I ran into an issue - the delimiter
> > TextIO
> > > > uses
> > > > > > seems to be hardcoded '\n'.
> > > > > > See `findSeparatorBounds` -
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > >
> > > > > > The use case is to have a file of objects, encoded into bytes
> > using a
> > > > > > coder. However, '\n' is not a good delimiter here, as you can
> > > imagine.
> > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > > where
> > > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > > >
> > > > > > I did not find any unit tests which use TextIO to read anything
> > other
> > > > > than
> > > > > > Strings.
> > > > > >
> > > > >
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbono...@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>
>

Reply via email to