Re: TextIO binary file

Robert Bradshaw Mon, 06 Feb 2017 09:53:33 -0800

I agree documentation should be better around this.

If we decided to go down the route of adding FileObjectIO, I would suggest
making it a PTransform that wraps an AvroIO rather than a new file format
(unless it can be demonstrated that there are significant performance
concerns). In this case on simply need s T <-> POJO, not T <-> bytes[],
though bytes[] is (essentially) POJO so that could be supported as well.


On Mon, Feb 6, 2017 at 5:37 AM, Aviem Zur <[email protected]> wrote:

> I believe these are actually several different use cases with different
> paths (Some do not exist today):
>
> User wants to output results of a pipeline to be used in a different
> pipeline:
> (This is indeed a use case in my organization. Also, Spark sees this as a
> use case as well with its `saveAsObjectFile`)
>
>    1. User's objects are Avro-serializable.
>    1. [Today] User figures out they can use AvroIO - not very user
>       friendly, user has to be a little more savvy.
>          1. [Suggestion #1] Improve documentation for this use case.
>          2. [Suggestion #2] User uses a more user-friendly FileObjectIO.
>             1. FileObjectIO, seeing the objects are
>             Avro-serializable, delegates to AvroIO under the hood.
>          2. User's objects are not necessarily Avro-serializable.
>    1. [Today] User writes a custom IO, from scratch, including lower level
>       IO operations and tests.
>       2. [Suggestion #3] User uses a more user-friendly FileObjectIO and
>       provides the proper coder.
>          1. User uses an existing coder, compatible with their object, for
>          example - SerializableCoder for serializable objects.
>          2. No existing coder is compatible with their object, user writes
>          a custom coder.
>          3. * Regarding writing several objects to one stream - this issue
>          is already acknowledged and addressed in
> `IterableLikeCoder`s. Some coders
>          do not require delimiters (Like AvroCoder), others, like
> ProtoCoder already
>          handle writing with delimiters. So it seems that the implicit
> contract for
>          coders, already handles this. I did not find the appropriate
> javadoc that
>          declares this contract.
>
>
> On Mon, Feb 6, 2017 at 12:13 AM Eugene Kirpichov
> <[email protected]> wrote:
>
> > Hmm, do you have a concrete use case in mind, where all these
> circumstances
> > come together?
> > - There is a need to write data from one Beam pipeline, and read it from
> > another Beam pipeline
> > - These pipelines have to be kept separate, rather than merged into one
> > pipeline
> > - It is okay that the data can not be parsed by anything except a Beam
> > pipeline using a compatible SDK version
> > - The objects to be serialized are not POJOs
> > - The user has chosen and was sufficiently savvy to develop a Coder for
> > these objects (as opposed to representing them as POJOs and using AvroIO,
> > which I think requires a rather less savvy user)
> >
> > It just seems like a really exotic set of circumstances to me. If the
> > problem is that people don't realize that the easiest way to serialize
> > their data is make it a POJO and use AvroIO, we can solve by improving
> > documentation about coders.
> >
> > On Sun, Feb 5, 2017 at 1:41 PM Aviem Zur <[email protected]> wrote:
> >
> > > AvroIO would is great for POJOs. But for use cases with more complex,
> > > serializable objects, or objects which are compatible with some coder
> it
> > > falls short.
> > >
> > > Also, for less savvy users to know they need to use AvroIO might be a
> > > stretch.
> > > Some simpler API along the the lines of ObjectFile might be more user
> > > friendly (even if for optimization it uses avro under the hood for
> > POJOs).
> > >
> > > On Sun, Feb 5, 2017, 22:00 Eugene Kirpichov
> <[email protected]
> > >
> > > wrote:
> > >
> > > > OK, I see what you mean; however I still think this can be solved
> > without
> > > > introducing a new "Beam object file" (or whatever) file format, and
> > > without
> > > > thereby introducing additional use cases and compatibility
> constraints
> > on
> > > > coders.
> > > >
> > > > I asked before in the thread why not just use AvroIO (it can
> serialize
> > > > arbitrary POJOs using reflection); I skimmed the thread it doesn't
> seem
> > > > like that got answered properly. I also like Dan's suggestion to use
> > > AvroIO
> > > > to serialize byte[] arrays and you can do whatever you want with them
> > > (e.g.
> > > > use another serialization library, say, Kryo, or Java serialization,
> > > etc.)
> > > >
> > > > On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur <[email protected]>
> wrote:
> > > >
> > > > > I agree that these files will serve no use outside of Beam
> pipelines.
> > > > >
> > > > > The rationale was that you might want to have one pipeline write
> > output
> > > > to
> > > > > files and then have a different pipeline that uses those files as
> > > inputs.
> > > > >
> > > > > Say one team in your organization creates a pipeline and a
> different
> > > team
> > > > > utilizes those files as input for a different pipeline. The
> contract
> > > > > between them is the file, in a Beam-readable format.
> > > > > This is similar to Spark's `saveAsObjectFile`
> > > https://github.com/apache/
> > > > >
> > > >
> > >
> > spark/blob/master/core/src/main/scala/org/apache/spark/
> rdd/RDD.scala#L1512
> > > > > <
> > > >
> > >
> > https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > >
> > > > >
> > > > > The merit for something like this in my eyes is to not burden the
> > user
> > > > with
> > > > > writing a custom IO
> > > > >
> > > > > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
> > > > > <[email protected]> wrote:
> > > > >
> > > > > +1 to Robert. Either this will be a Beam-specific file format (and
> > then
> > > > > nothing except Beam will be able to read it - which I doubt is what
> > you
> > > > > want), or it is an existing well-known file format and then we
> should
> > > > just
> > > > > develop an IO for it.
> > > > > Note that any file format that involves encoding elements with a
> > Coder
> > > is
> > > > > Beam-specific, because wire format of coders is Beam-specific.
> > > > >
> > > > > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <[email protected]>
> > > > wrote:
> > > > > > > +1 on what Stas said.
> > > > > > > I think there is value in not having the user write a custom IO
> > > for a
> > > > > > > protocol they use which is not covered by Beam IOs. Plus having
> > > them
> > > > > deal
> > > > > > > with not only the encoding but also the IO part is not ideal.
> > > > > > > I think having a basic FileIO that can write to the Filesystems
> > > > > supported
> > > > > > > by Beam (GS/HDFS/Local/...) which you can use any coder with,
> > > > including
> > > > > > > your own custom coder, can be beneficial.
> > > > > >
> > > > > > What would the format of the file be? Just the concatenation of
> the
> > > > > > elements encoded according to the coder? Or is there a delimiter
> > > > > > needed to separate records. In which case how does one ensure the
> > > > > > delimiter does not also appear in the middle of an encoded
> element?
> > > At
> > > > > > this point you're developing a file format, and might as well
> stick
> > > > > > with one of the standard ones. https://xkcd.com/927
> > > > > >
> > > > > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > I believe the motivation is to have an abstraction that allows
> > one
> > > to
> > > > > > write
> > > > > > > stuff to a file in a way that is agnostic to the coder.
> > > > > > > If one needs to write a non-Avro protocol to a file, and this
> > > > > particular
> > > > > > > protocol does not meet the assumption made by TextIO, one might
> > > need
> > > > to
> > > > > > > duplicate the file IO related code from AvroIO.
> > > > > > >
> > > > > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > >> Could you clarify why it would be useful to write objects to
> > files
> > > > > using
> > > > > > >> Beam coders, as opposed to just using e.g. AvroIO?
> > > > > > >>
> > > > > > >> Coders (should) make no promise as to what their wire format
> is,
> > > so
> > > > > such
> > > > > > >> files could be read back only by other Beam pipelines using
> the
> > > same
> > > > > IO.
> > > > > > >>
> > > > > > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <[email protected]
> >
> > > > wrote:
> > > > > > >>
> > > > > > >> > So If I understand the general agreement is that TextIO
> should
> > > not
> > > > > > >> support
> > > > > > >> > anything but lines from files as strings.
> > > > > > >> > I'll go ahead and file a ticket that says the Javadoc should
> > be
> > > > > > changed
> > > > > > >> to
> > > > > > >> > reflect this and `withCoder` method should be removed.
> > > > > > >> >
> > > > > > >> > Is there merit for Beam to supply an IO which does allow
> > writing
> > > > > > objects
> > > > > > >> to
> > > > > > >> > a file using Beam coders and Beam FS (To write these files
> to
> > > > > > >> > GS/Hadoop/Local)?
> > > > > > >> >
> > > > > > >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> > > > > > >> > <[email protected]> wrote:
> > > > > > >> >
> > > > > > >> > P.S. Note that this point (about coders) is also mentioned
> in
> > > the
> > > > > > >> > now-being-reviewed PTransform Style Guide
> > > > > > >> > https://github.com/apache/beam-site/pull/134
> > > > > > >> > currently staged at
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > http://apache-beam-website-pull-requests.storage.
> googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> > > > > > [email protected]
> > > > > > >> >
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > +1 to what Eugene said.
> > > > > > >> > >
> > > > > > >> > > I've seen a number of Python SDK users incorrectly
> assuming
> > > that
> > > > > > >> > > coder.decode() is needed when developing their own
> > file-based
> > > > > > sources
> > > > > > >> > > (since many users usually refer to text source first).
> > > Probably
> > > > > > coder
> > > > > > >> > > parameter should not be configurable for text source/sink
> > and
> > > > they
> > > > > > >> should
> > > > > > >> > > be updated to only read/write UTF-8 encoded strings.
> > > > > > >> > >
> > > > > > >> > > - Cham
> > > > > > >> > >
> > > > > > >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > > > > > >> > > <[email protected]> wrote:
> > > > > > >> > >
> > > > > > >> > > > The use of Coder in TextIO is a long standing design
> issue
> > > > > because
> > > > > > >> > coders
> > > > > > >> > > > are not intended to be used for general purpose
> converting
> > > > > things
> > > > > > >> from
> > > > > > >> > > and
> > > > > > >> > > > to bytes, their only proper use is letting the runner
> > > > > materialize
> > > > > > > and
> > > > > > >> > > > restore objects if the runner thinks it's necessary. IMO
> > it
> > > > > should
> > > > > > >> have
> > > > > > >> > > > been called LineIO, document that it reads lines of text
> > as
> > > > > > String,
> > > > > > >> and
> > > > > > >> > > not
> > > > > > >> > > > have a withCoder parameter at all.
> > > > > > >> > > >
> > > > > > >> > > > The proper way to address your use case is to write a
> > custom
> > > > > > >> > > > FileBasedSource.
> > > > > > >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <
> > > [email protected]
> > > > >
> > > > > > >> wrote:
> > > > > > >> > > >
> > > > > > >> > > > > The Javadoc of TextIO states:
> > > > > > >> > > > >
> > > > > > >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link
> > > > > > PCollection}
> > > > > > >> of
> > > > > > >> > > > > {@link String Strings},
> > > > > > >> > > > >  * each corresponding to one line of an input UTF-8
> text
> > > > file.
> > > > > > To
> > > > > > >> > > convert
> > > > > > >> > > > > directly from the raw
> > > > > > >> > > > >  * bytes (split into lines delimited by '\n', '\r', or
> > > > '\r\n')
> > > > > > to
> > > > > > >> > > another
> > > > > > >> > > > > object of type {@code T},
> > > > > > >> > > > >  * supply a {@code Coder<T>} using {@link
> > > > > > >> > > TextIO.Read#withCoder(Coder)}.
> > > > > > >> > > > >
> > > > > > >> > > > > However, as I stated, `withCoder` doesn't seem to have
> > > > tests,
> > > > > > and
> > > > > > >> > > > probably
> > > > > > >> > > > > won't work given the hard-coded '\n' delimiter.
> > > > > > >> > > > >
> > > > > > >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré
> <
> > > > > > >> > [email protected]
> > > > > > >> > > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hi Aviem,
> > > > > > >> > > > > >
> > > > > > >> > > > > > TextIO is not designed to write/read binary file:
> it's
> > > > pure
> > > > > > > Text,
> > > > > > >> > so
> > > > > > >> > > > > > String.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Regards
> > > > > > >> > > > > > JB
> > > > > > >> > > > > >
> > > > > > >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > >> > > > > > > Hi,
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > While trying to use TextIO to write/read a binary
> > file
> > > > > > rather
> > > > > > >> > than
> > > > > > >> > > > > String
> > > > > > >> > > > > > > lines from a textual file I ran into an issue -
> the
> > > > > > delimiter
> > > > > > >> > > TextIO
> > > > > > >> > > > > uses
> > > > > > >> > > > > > > seems to be hardcoded '\n'.
> > > > > > >> > > > > > > See `findSeparatorBounds` -
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/beam/blob/master/sdks/java/
> core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > The use case is to have a file of objects, encoded
> > > into
> > > > > > bytes
> > > > > > >> > > using a
> > > > > > >> > > > > > > coder. However, '\n' is not a good delimiter here,
> > as
> > > > you
> > > > > > can
> > > > > > >> > > > imagine.
> > > > > > >> > > > > > > A similar pattern is found in Spark's
> > > `saveAsObjectFile`
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > > >> > > > > > > where
> > > > > > >> > > > > > > they use a more appropriate delimiter, to avoid
> such
> > > > > issues.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > I did not find any unit tests which use TextIO to
> > read
> > > > > > > anything
> > > > > > >> > > other
> > > > > > >> > > > > > than
> > > > > > >> > > > > > > Strings.
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > --
> > > > > > >> > > > > > Jean-Baptiste Onofré
> > > > > > >> > > > > > [email protected]
> > > > > > >> > > > > > http://blog.nanthrax.net
> > > > > > >> > > > > > Talend - http://www.talend.com
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: TextIO binary file

Reply via email to