Re: TextIO binary file

Aviem Zur Sun, 05 Feb 2017 11:38:31 -0800

I agree that these files will serve no use outside of Beam pipelines.

The rationale was that you might want to have one pipeline write output to
files and then have a different pipeline that uses those files as inputs.


Say one team in your organization creates a pipeline and a different team
utilizes those files as input for a different pipeline. The contract
between them is the file, in a Beam-readable format.
This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512

The merit for something like this in my eyes is to not burden the user with
writing a custom IO

On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
<[email protected]> wrote:

+1 to Robert. Either this will be a Beam-specific file format (and then
nothing except Beam will be able to read it - which I doubt is what you
want), or it is an existing well-known file format and then we should just
develop an IO for it.
Note that any file format that involves encoding elements with a Coder is
Beam-specific, because wire format of coders is Beam-specific.

On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
<[email protected]> wrote:

> On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <[email protected]> wrote:
> > +1 on what Stas said.
> > I think there is value in not having the user write a custom IO for a
> > protocol they use which is not covered by Beam IOs. Plus having them
deal
> > with not only the encoding but also the IO part is not ideal.
> > I think having a basic FileIO that can write to the Filesystems
supported
> > by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> > your own custom coder, can be beneficial.
>
> What would the format of the file be? Just the concatenation of the
> elements encoded according to the coder? Or is there a delimiter
> needed to separate records. In which case how does one ensure the
> delimiter does not also appear in the middle of an encoded element? At
> this point you're developing a file format, and might as well stick
> with one of the standard ones. https://xkcd.com/927
>
> > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <[email protected]> wrote:
> >
> > I believe the motivation is to have an abstraction that allows one to
> write
> > stuff to a file in a way that is agnostic to the coder.
> > If one needs to write a non-Avro protocol to a file, and this particular
> > protocol does not meet the assumption made by TextIO, one might need to
> > duplicate the file IO related code from AvroIO.
> >
> > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > <[email protected]> wrote:
> >
> >> Could you clarify why it would be useful to write objects to files
using
> >> Beam coders, as opposed to just using e.g. AvroIO?
> >>
> >> Coders (should) make no promise as to what their wire format is, so
such
> >> files could be read back only by other Beam pipelines using the same
IO.
> >>
> >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <[email protected]> wrote:
> >>
> >> > So If I understand the general agreement is that TextIO should not
> >> support
> >> > anything but lines from files as strings.
> >> > I'll go ahead and file a ticket that says the Javadoc should be
> changed
> >> to
> >> > reflect this and `withCoder` method should be removed.
> >> >
> >> > Is there merit for Beam to supply an IO which does allow writing
> objects
> >> to
> >> > a file using Beam coders and Beam FS (To write these files to
> >> > GS/Hadoop/Local)?
> >> >
> >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> >> > <[email protected]> wrote:
> >> >
> >> > P.S. Note that this point (about coders) is also mentioned in the
> >> > now-being-reviewed PTransform Style Guide
> >> > https://github.com/apache/beam-site/pull/134
> >> > currently staged at
> >> >
> >> >
> >>
> >
>
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >> >
> >> >
> >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> [email protected]
> >> >
> >> > wrote:
> >> >
> >> > > +1 to what Eugene said.
> >> > >
> >> > > I've seen a number of Python SDK users incorrectly assuming that
> >> > > coder.decode() is needed when developing their own file-based
> sources
> >> > > (since many users usually refer to text source first). Probably
> coder
> >> > > parameter should not be configurable for text source/sink and they
> >> should
> >> > > be updated to only read/write UTF-8 encoded strings.
> >> > >
> >> > > - Cham
> >> > >
> >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> >> > > <[email protected]> wrote:
> >> > >
> >> > > > The use of Coder in TextIO is a long standing design issue
because
> >> > coders
> >> > > > are not intended to be used for general purpose converting things
> >> from
> >> > > and
> >> > > > to bytes, their only proper use is letting the runner materialize
> > and
> >> > > > restore objects if the runner thinks it's necessary. IMO it
should
> >> have
> >> > > > been called LineIO, document that it reads lines of text as
> String,
> >> and
> >> > > not
> >> > > > have a withCoder parameter at all.
> >> > > >
> >> > > > The proper way to address your use case is to write a custom
> >> > > > FileBasedSource.
> >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <[email protected]>
> >> wrote:
> >> > > >
> >> > > > > The Javadoc of TextIO states:
> >> > > > >
> >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link
> PCollection}
> >> of
> >> > > > > {@link String Strings},
> >> > > > >  * each corresponding to one line of an input UTF-8 text file.
> To
> >> > > convert
> >> > > > > directly from the raw
> >> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n')
> to
> >> > > another
> >> > > > > object of type {@code T},
> >> > > > >  * supply a {@code Coder<T>} using {@link
> >> > > TextIO.Read#withCoder(Coder)}.
> >> > > > >
> >> > > > > However, as I stated, `withCoder` doesn't seem to have tests,
> and
> >> > > > probably
> >> > > > > won't work given the hard-coded '\n' delimiter.
> >> > > > >
> >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> >> > [email protected]
> >> > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hi Aviem,
> >> > > > > >
> >> > > > > > TextIO is not designed to write/read binary file: it's pure
> > Text,
> >> > so
> >> > > > > > String.
> >> > > > > >
> >> > > > > > Regards
> >> > > > > > JB
> >> > > > > >
> >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> >> > > > > > > Hi,
> >> > > > > > >
> >> > > > > > > While trying to use TextIO to write/read a binary file
> rather
> >> > than
> >> > > > > String
> >> > > > > > > lines from a textual file I ran into an issue - the
> delimiter
> >> > > TextIO
> >> > > > > uses
> >> > > > > > > seems to be hardcoded '\n'.
> >> > > > > > > See `findSeparatorBounds` -
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >>
> >
>
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> >> > > > > > >
> >> > > > > > > The use case is to have a file of objects, encoded into
> bytes
> >> > > using a
> >> > > > > > > coder. However, '\n' is not a good delimiter here, as you
> can
> >> > > > imagine.
> >> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >>
> >
>
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> >> > > > > > > where
> >> > > > > > > they use a more appropriate delimiter, to avoid such
issues.
> >> > > > > > >
> >> > > > > > > I did not find any unit tests which use TextIO to read
> > anything
> >> > > other
> >> > > > > > than
> >> > > > > > > Strings.
> >> > > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Jean-Baptiste Onofré
> >> > > > > > [email protected]
> >> > > > > > http://blog.nanthrax.net
> >> > > > > > Talend - http://www.talend.com
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: TextIO binary file

Reply via email to