Could you clarify why it would be useful to write objects to files using
Beam coders, as opposed to just using e.g. AvroIO?

Coders (should) make no promise as to what their wire format is, so such
files could be read back only by other Beam pipelines using the same IO.

On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <[email protected]> wrote:

> So If I understand the general agreement is that TextIO should not support
> anything but lines from files as strings.
> I'll go ahead and file a ticket that says the Javadoc should be changed to
> reflect this and `withCoder` method should be removed.
>
> Is there merit for Beam to supply an IO which does allow writing objects to
> a file using Beam coders and Beam FS (To write these files to
> GS/Hadoop/Local)?
>
> On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> <[email protected]> wrote:
>
> P.S. Note that this point (about coders) is also mentioned in the
> now-being-reviewed PTransform Style Guide
> https://github.com/apache/beam-site/pull/134
> currently staged at
>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
>
>
> On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <[email protected]>
> wrote:
>
> > +1 to what Eugene said.
> >
> > I've seen a number of Python SDK users incorrectly assuming that
> > coder.decode() is needed when developing their own file-based sources
> > (since many users usually refer to text source first). Probably coder
> > parameter should not be configurable for text source/sink and they should
> > be updated to only read/write UTF-8 encoded strings.
> >
> > - Cham
> >
> > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > <[email protected]> wrote:
> >
> > > The use of Coder in TextIO is a long standing design issue because
> coders
> > > are not intended to be used for general purpose converting things from
> > and
> > > to bytes, their only proper use is letting the runner materialize and
> > > restore objects if the runner thinks it's necessary. IMO it should have
> > > been called LineIO, document that it reads lines of text as String, and
> > not
> > > have a withCoder parameter at all.
> > >
> > > The proper way to address your use case is to write a custom
> > > FileBasedSource.
> > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <[email protected]> wrote:
> > >
> > > > The Javadoc of TextIO states:
> > > >
> > > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > > {@link String Strings},
> > > >  * each corresponding to one line of an input UTF-8 text file. To
> > convert
> > > > directly from the raw
> > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > another
> > > > object of type {@code T},
> > > >  * supply a {@code Coder<T>} using {@link
> > TextIO.Read#withCoder(Coder)}.
> > > >
> > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > probably
> > > > won't work given the hard-coded '\n' delimiter.
> > > >
> > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > Hi Aviem,
> > > > >
> > > > > TextIO is not designed to write/read binary file: it's pure Text,
> so
> > > > > String.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > Hi,
> > > > > >
> > > > > > While trying to use TextIO to write/read a binary file rather
> than
> > > > String
> > > > > > lines from a textual file I ran into an issue - the delimiter
> > TextIO
> > > > uses
> > > > > > seems to be hardcoded '\n'.
> > > > > > See `findSeparatorBounds` -
> > > > > >
> > > > >
> > > >
> > >
> >
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > >
> > > > > > The use case is to have a file of objects, encoded into bytes
> > using a
> > > > > > coder. However, '\n' is not a good delimiter here, as you can
> > > imagine.
> > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > > >
> > > > >
> > > >
> > >
> >
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > > where
> > > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > > >
> > > > > > I did not find any unit tests which use TextIO to read anything
> > other
> > > > > than
> > > > > > Strings.
> > > > > >
> > > > >
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > [email protected]
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>

Reply via email to