+1 to what Eugene said.

I've seen a number of Python SDK users incorrectly assuming that
coder.decode() is needed when developing their own file-based sources
(since many users usually refer to text source first). Probably coder
parameter should not be configurable for text source/sink and they should
be updated to only read/write UTF-8 encoded strings.

- Cham

On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:

> The use of Coder in TextIO is a long standing design issue because coders
> are not intended to be used for general purpose converting things from and
> to bytes, their only proper use is letting the runner materialize and
> restore objects if the runner thinks it's necessary. IMO it should have
> been called LineIO, document that it reads lines of text as String, and not
> have a withCoder parameter at all.
>
> The proper way to address your use case is to write a custom
> FileBasedSource.
> On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <aviem...@gmail.com> wrote:
>
> > The Javadoc of TextIO states:
> >
> > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> > {@link String Strings},
> >  * each corresponding to one line of an input UTF-8 text file. To convert
> > directly from the raw
> >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
> > object of type {@code T},
> >  * supply a {@code Coder<T>} using {@link TextIO.Read#withCoder(Coder)}.
> >
> > However, as I stated, `withCoder` doesn't seem to have tests, and
> probably
> > won't work given the hard-coded '\n' delimiter.
> >
> > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> > > Hi Aviem,
> > >
> > > TextIO is not designed to write/read binary file: it's pure Text, so
> > > String.
> > >
> > > Regards
> > > JB
> > >
> > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > Hi,
> > > >
> > > > While trying to use TextIO to write/read a binary file rather than
> > String
> > > > lines from a textual file I ran into an issue - the delimiter TextIO
> > uses
> > > > seems to be hardcoded '\n'.
> > > > See `findSeparatorBounds` -
> > > >
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > >
> > > > The use case is to have a file of objects, encoded into bytes using a
> > > > coder. However, '\n' is not a good delimiter here, as you can
> imagine.
> > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > where
> > > > they use a more appropriate delimiter, to avoid such issues.
> > > >
> > > > I did not find any unit tests which use TextIO to read anything other
> > > than
> > > > Strings.
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Reply via email to