I agree documentation should be better around this. If we decided to go down the route of adding FileObjectIO, I would suggest making it a PTransform that wraps an AvroIO rather than a new file format (unless it can be demonstrated that there are significant performance concerns). In this case on simply need s T <-> POJO, not T <-> bytes[], though bytes[] is (essentially) POJO so that could be supported as well.
On Mon, Feb 6, 2017 at 5:37 AM, Aviem Zur <aviem...@gmail.com> wrote: > I believe these are actually several different use cases with different > paths (Some do not exist today): > > User wants to output results of a pipeline to be used in a different > pipeline: > (This is indeed a use case in my organization. Also, Spark sees this as a > use case as well with its `saveAsObjectFile`) > > 1. User's objects are Avro-serializable. > 1. [Today] User figures out they can use AvroIO - not very user > friendly, user has to be a little more savvy. > 1. [Suggestion #1] Improve documentation for this use case. > 2. [Suggestion #2] User uses a more user-friendly FileObjectIO. > 1. FileObjectIO, seeing the objects are > Avro-serializable, delegates to AvroIO under the hood. > 2. User's objects are not necessarily Avro-serializable. > 1. [Today] User writes a custom IO, from scratch, including lower level > IO operations and tests. > 2. [Suggestion #3] User uses a more user-friendly FileObjectIO and > provides the proper coder. > 1. User uses an existing coder, compatible with their object, for > example - SerializableCoder for serializable objects. > 2. No existing coder is compatible with their object, user writes > a custom coder. > 3. * Regarding writing several objects to one stream - this issue > is already acknowledged and addressed in > `IterableLikeCoder`s. Some coders > do not require delimiters (Like AvroCoder), others, like > ProtoCoder already > handle writing with delimiters. So it seems that the implicit > contract for > coders, already handles this. I did not find the appropriate > javadoc that > declares this contract. > > > On Mon, Feb 6, 2017 at 12:13 AM Eugene Kirpichov > <kirpic...@google.com.invalid> wrote: > > > Hmm, do you have a concrete use case in mind, where all these > circumstances > > come together? > > - There is a need to write data from one Beam pipeline, and read it from > > another Beam pipeline > > - These pipelines have to be kept separate, rather than merged into one > > pipeline > > - It is okay that the data can not be parsed by anything except a Beam > > pipeline using a compatible SDK version > > - The objects to be serialized are not POJOs > > - The user has chosen and was sufficiently savvy to develop a Coder for > > these objects (as opposed to representing them as POJOs and using AvroIO, > > which I think requires a rather less savvy user) > > > > It just seems like a really exotic set of circumstances to me. If the > > problem is that people don't realize that the easiest way to serialize > > their data is make it a POJO and use AvroIO, we can solve by improving > > documentation about coders. > > > > On Sun, Feb 5, 2017 at 1:41 PM Aviem Zur <aviem...@gmail.com> wrote: > > > > > AvroIO would is great for POJOs. But for use cases with more complex, > > > serializable objects, or objects which are compatible with some coder > it > > > falls short. > > > > > > Also, for less savvy users to know they need to use AvroIO might be a > > > stretch. > > > Some simpler API along the the lines of ObjectFile might be more user > > > friendly (even if for optimization it uses avro under the hood for > > POJOs). > > > > > > On Sun, Feb 5, 2017, 22:00 Eugene Kirpichov > <kirpic...@google.com.invalid > > > > > > wrote: > > > > > > > OK, I see what you mean; however I still think this can be solved > > without > > > > introducing a new "Beam object file" (or whatever) file format, and > > > without > > > > thereby introducing additional use cases and compatibility > constraints > > on > > > > coders. > > > > > > > > I asked before in the thread why not just use AvroIO (it can > serialize > > > > arbitrary POJOs using reflection); I skimmed the thread it doesn't > seem > > > > like that got answered properly. I also like Dan's suggestion to use > > > AvroIO > > > > to serialize byte[] arrays and you can do whatever you want with them > > > (e.g. > > > > use another serialization library, say, Kryo, or Java serialization, > > > etc.) > > > > > > > > On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur <aviem...@gmail.com> > wrote: > > > > > > > > > I agree that these files will serve no use outside of Beam > pipelines. > > > > > > > > > > The rationale was that you might want to have one pipeline write > > output > > > > to > > > > > files and then have a different pipeline that uses those files as > > > inputs. > > > > > > > > > > Say one team in your organization creates a pipeline and a > different > > > team > > > > > utilizes those files as input for a different pipeline. The > contract > > > > > between them is the file, in a Beam-readable format. > > > > > This is similar to Spark's `saveAsObjectFile` > > > https://github.com/apache/ > > > > > > > > > > > > > > spark/blob/master/core/src/main/scala/org/apache/spark/ > rdd/RDD.scala#L1512 > > > > > < > > > > > > > > > https://github.com/apache/spark/blob/master/core/src/ > main/scala/org/apache/spark/rdd/RDD.scala#L1512 > > > > > > > > > > > > > > > The merit for something like this in my eyes is to not burden the > > user > > > > with > > > > > writing a custom IO > > > > > > > > > > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov > > > > > <kirpic...@google.com.invalid> wrote: > > > > > > > > > > +1 to Robert. Either this will be a Beam-specific file format (and > > then > > > > > nothing except Beam will be able to read it - which I doubt is what > > you > > > > > want), or it is an existing well-known file format and then we > should > > > > just > > > > > develop an IO for it. > > > > > Note that any file format that involves encoding elements with a > > Coder > > > is > > > > > Beam-specific, because wire format of coders is Beam-specific. > > > > > > > > > > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw > > > > > <rober...@google.com.invalid> wrote: > > > > > > > > > > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <aviem...@gmail.com> > > > > wrote: > > > > > > > +1 on what Stas said. > > > > > > > I think there is value in not having the user write a custom IO > > > for a > > > > > > > protocol they use which is not covered by Beam IOs. Plus having > > > them > > > > > deal > > > > > > > with not only the encoding but also the IO part is not ideal. > > > > > > > I think having a basic FileIO that can write to the Filesystems > > > > > supported > > > > > > > by Beam (GS/HDFS/Local/...) which you can use any coder with, > > > > including > > > > > > > your own custom coder, can be beneficial. > > > > > > > > > > > > What would the format of the file be? Just the concatenation of > the > > > > > > elements encoded according to the coder? Or is there a delimiter > > > > > > needed to separate records. In which case how does one ensure the > > > > > > delimiter does not also appear in the middle of an encoded > element? > > > At > > > > > > this point you're developing a file format, and might as well > stick > > > > > > with one of the standard ones. https://xkcd.com/927 > > > > > > > > > > > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin < > stasle...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > I believe the motivation is to have an abstraction that allows > > one > > > to > > > > > > write > > > > > > > stuff to a file in a way that is agnostic to the coder. > > > > > > > If one needs to write a non-Avro protocol to a file, and this > > > > > particular > > > > > > > protocol does not meet the assumption made by TextIO, one might > > > need > > > > to > > > > > > > duplicate the file IO related code from AvroIO. > > > > > > > > > > > > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov > > > > > > > <kirpic...@google.com.invalid> wrote: > > > > > > > > > > > > > >> Could you clarify why it would be useful to write objects to > > files > > > > > using > > > > > > >> Beam coders, as opposed to just using e.g. AvroIO? > > > > > > >> > > > > > > >> Coders (should) make no promise as to what their wire format > is, > > > so > > > > > such > > > > > > >> files could be read back only by other Beam pipelines using > the > > > same > > > > > IO. > > > > > > >> > > > > > > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <aviem...@gmail.com > > > > > > wrote: > > > > > > >> > > > > > > >> > So If I understand the general agreement is that TextIO > should > > > not > > > > > > >> support > > > > > > >> > anything but lines from files as strings. > > > > > > >> > I'll go ahead and file a ticket that says the Javadoc should > > be > > > > > > changed > > > > > > >> to > > > > > > >> > reflect this and `withCoder` method should be removed. > > > > > > >> > > > > > > > >> > Is there merit for Beam to supply an IO which does allow > > writing > > > > > > objects > > > > > > >> to > > > > > > >> > a file using Beam coders and Beam FS (To write these files > to > > > > > > >> > GS/Hadoop/Local)? > > > > > > >> > > > > > > > >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov > > > > > > >> > <kirpic...@google.com.invalid> wrote: > > > > > > >> > > > > > > > >> > P.S. Note that this point (about coders) is also mentioned > in > > > the > > > > > > >> > now-being-reviewed PTransform Style Guide > > > > > > >> > https://github.com/apache/beam-site/pull/134 > > > > > > >> > currently staged at > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://apache-beam-website-pull-requests.storage. > googleapis.com/134/contribute/ptransform-style-guide/index.html#coders > > > > > > >> > > > > > > > >> > > > > > > > >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath < > > > > > > chamik...@apache.org > > > > > > >> > > > > > > > >> > wrote: > > > > > > >> > > > > > > > >> > > +1 to what Eugene said. > > > > > > >> > > > > > > > > >> > > I've seen a number of Python SDK users incorrectly > assuming > > > that > > > > > > >> > > coder.decode() is needed when developing their own > > file-based > > > > > > sources > > > > > > >> > > (since many users usually refer to text source first). > > > Probably > > > > > > coder > > > > > > >> > > parameter should not be configurable for text source/sink > > and > > > > they > > > > > > >> should > > > > > > >> > > be updated to only read/write UTF-8 encoded strings. > > > > > > >> > > > > > > > > >> > > - Cham > > > > > > >> > > > > > > > > >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov > > > > > > >> > > <kirpic...@google.com.invalid> wrote: > > > > > > >> > > > > > > > > >> > > > The use of Coder in TextIO is a long standing design > issue > > > > > because > > > > > > >> > coders > > > > > > >> > > > are not intended to be used for general purpose > converting > > > > > things > > > > > > >> from > > > > > > >> > > and > > > > > > >> > > > to bytes, their only proper use is letting the runner > > > > > materialize > > > > > > > and > > > > > > >> > > > restore objects if the runner thinks it's necessary. IMO > > it > > > > > should > > > > > > >> have > > > > > > >> > > > been called LineIO, document that it reads lines of text > > as > > > > > > String, > > > > > > >> and > > > > > > >> > > not > > > > > > >> > > > have a withCoder parameter at all. > > > > > > >> > > > > > > > > > >> > > > The proper way to address your use case is to write a > > custom > > > > > > >> > > > FileBasedSource. > > > > > > >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur < > > > aviem...@gmail.com > > > > > > > > > > > >> wrote: > > > > > > >> > > > > > > > > > >> > > > > The Javadoc of TextIO states: > > > > > > >> > > > > > > > > > > >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link > > > > > > PCollection} > > > > > > >> of > > > > > > >> > > > > {@link String Strings}, > > > > > > >> > > > > * each corresponding to one line of an input UTF-8 > text > > > > file. > > > > > > To > > > > > > >> > > convert > > > > > > >> > > > > directly from the raw > > > > > > >> > > > > * bytes (split into lines delimited by '\n', '\r', or > > > > '\r\n') > > > > > > to > > > > > > >> > > another > > > > > > >> > > > > object of type {@code T}, > > > > > > >> > > > > * supply a {@code Coder<T>} using {@link > > > > > > >> > > TextIO.Read#withCoder(Coder)}. > > > > > > >> > > > > > > > > > > >> > > > > However, as I stated, `withCoder` doesn't seem to have > > > > tests, > > > > > > and > > > > > > >> > > > probably > > > > > > >> > > > > won't work given the hard-coded '\n' delimiter. > > > > > > >> > > > > > > > > > > >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré > < > > > > > > >> > j...@nanthrax.net > > > > > > >> > > > > > > > > > >> > > > > wrote: > > > > > > >> > > > > > > > > > > >> > > > > > Hi Aviem, > > > > > > >> > > > > > > > > > > > >> > > > > > TextIO is not designed to write/read binary file: > it's > > > > pure > > > > > > > Text, > > > > > > >> > so > > > > > > >> > > > > > String. > > > > > > >> > > > > > > > > > > > >> > > > > > Regards > > > > > > >> > > > > > JB > > > > > > >> > > > > > > > > > > > >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote: > > > > > > >> > > > > > > Hi, > > > > > > >> > > > > > > > > > > > > >> > > > > > > While trying to use TextIO to write/read a binary > > file > > > > > > rather > > > > > > >> > than > > > > > > >> > > > > String > > > > > > >> > > > > > > lines from a textual file I ran into an issue - > the > > > > > > delimiter > > > > > > >> > > TextIO > > > > > > >> > > > > uses > > > > > > >> > > > > > > seems to be hardcoded '\n'. > > > > > > >> > > > > > > See `findSeparatorBounds` - > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/beam/blob/master/sdks/java/ > core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024 > > > > > > >> > > > > > > > > > > > > >> > > > > > > The use case is to have a file of objects, encoded > > > into > > > > > > bytes > > > > > > >> > > using a > > > > > > >> > > > > > > coder. However, '\n' is not a good delimiter here, > > as > > > > you > > > > > > can > > > > > > >> > > > imagine. > > > > > > >> > > > > > > A similar pattern is found in Spark's > > > `saveAsObjectFile` > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/spark/blob/master/core/src/ > main/scala/org/apache/spark/rdd/RDD.scala#L1512 > > > > > > >> > > > > > > where > > > > > > >> > > > > > > they use a more appropriate delimiter, to avoid > such > > > > > issues. > > > > > > >> > > > > > > > > > > > > >> > > > > > > I did not find any unit tests which use TextIO to > > read > > > > > > > anything > > > > > > >> > > other > > > > > > >> > > > > > than > > > > > > >> > > > > > > Strings. > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > -- > > > > > > >> > > > > > Jean-Baptiste Onofré > > > > > > >> > > > > > jbono...@apache.org > > > > > > >> > > > > > http://blog.nanthrax.net > > > > > > >> > > > > > Talend - http://www.talend.com > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > >