OK, I see what you mean; however I still think this can be solved without introducing a new "Beam object file" (or whatever) file format, and without thereby introducing additional use cases and compatibility constraints on coders.
I asked before in the thread why not just use AvroIO (it can serialize arbitrary POJOs using reflection); I skimmed the thread it doesn't seem like that got answered properly. I also like Dan's suggestion to use AvroIO to serialize byte[] arrays and you can do whatever you want with them (e.g. use another serialization library, say, Kryo, or Java serialization, etc.) On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur <[email protected]> wrote: > I agree that these files will serve no use outside of Beam pipelines. > > The rationale was that you might want to have one pipeline write output to > files and then have a different pipeline that uses those files as inputs. > > Say one team in your organization creates a pipeline and a different team > utilizes those files as input for a different pipeline. The contract > between them is the file, in a Beam-readable format. > This is similar to Spark's `saveAsObjectFile` https://github.com/apache/ > spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512 > <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512> > > The merit for something like this in my eyes is to not burden the user with > writing a custom IO > > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov > <[email protected]> wrote: > > +1 to Robert. Either this will be a Beam-specific file format (and then > nothing except Beam will be able to read it - which I doubt is what you > want), or it is an existing well-known file format and then we should just > develop an IO for it. > Note that any file format that involves encoding elements with a Coder is > Beam-specific, because wire format of coders is Beam-specific. > > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw > <[email protected]> wrote: > > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <[email protected]> wrote: > > > +1 on what Stas said. > > > I think there is value in not having the user write a custom IO for a > > > protocol they use which is not covered by Beam IOs. Plus having them > deal > > > with not only the encoding but also the IO part is not ideal. > > > I think having a basic FileIO that can write to the Filesystems > supported > > > by Beam (GS/HDFS/Local/...) which you can use any coder with, including > > > your own custom coder, can be beneficial. > > > > What would the format of the file be? Just the concatenation of the > > elements encoded according to the coder? Or is there a delimiter > > needed to separate records. In which case how does one ensure the > > delimiter does not also appear in the middle of an encoded element? At > > this point you're developing a file format, and might as well stick > > with one of the standard ones. https://xkcd.com/927 > > > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <[email protected]> > wrote: > > > > > > I believe the motivation is to have an abstraction that allows one to > > write > > > stuff to a file in a way that is agnostic to the coder. > > > If one needs to write a non-Avro protocol to a file, and this > particular > > > protocol does not meet the assumption made by TextIO, one might need to > > > duplicate the file IO related code from AvroIO. > > > > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov > > > <[email protected]> wrote: > > > > > >> Could you clarify why it would be useful to write objects to files > using > > >> Beam coders, as opposed to just using e.g. AvroIO? > > >> > > >> Coders (should) make no promise as to what their wire format is, so > such > > >> files could be read back only by other Beam pipelines using the same > IO. > > >> > > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <[email protected]> wrote: > > >> > > >> > So If I understand the general agreement is that TextIO should not > > >> support > > >> > anything but lines from files as strings. > > >> > I'll go ahead and file a ticket that says the Javadoc should be > > changed > > >> to > > >> > reflect this and `withCoder` method should be removed. > > >> > > > >> > Is there merit for Beam to supply an IO which does allow writing > > objects > > >> to > > >> > a file using Beam coders and Beam FS (To write these files to > > >> > GS/Hadoop/Local)? > > >> > > > >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov > > >> > <[email protected]> wrote: > > >> > > > >> > P.S. Note that this point (about coders) is also mentioned in the > > >> > now-being-reviewed PTransform Style Guide > > >> > https://github.com/apache/beam-site/pull/134 > > >> > currently staged at > > >> > > > >> > > > >> > > > > > > > http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders > > >> > > > >> > > > >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath < > > [email protected] > > >> > > > >> > wrote: > > >> > > > >> > > +1 to what Eugene said. > > >> > > > > >> > > I've seen a number of Python SDK users incorrectly assuming that > > >> > > coder.decode() is needed when developing their own file-based > > sources > > >> > > (since many users usually refer to text source first). Probably > > coder > > >> > > parameter should not be configurable for text source/sink and they > > >> should > > >> > > be updated to only read/write UTF-8 encoded strings. > > >> > > > > >> > > - Cham > > >> > > > > >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov > > >> > > <[email protected]> wrote: > > >> > > > > >> > > > The use of Coder in TextIO is a long standing design issue > because > > >> > coders > > >> > > > are not intended to be used for general purpose converting > things > > >> from > > >> > > and > > >> > > > to bytes, their only proper use is letting the runner > materialize > > > and > > >> > > > restore objects if the runner thinks it's necessary. IMO it > should > > >> have > > >> > > > been called LineIO, document that it reads lines of text as > > String, > > >> and > > >> > > not > > >> > > > have a withCoder parameter at all. > > >> > > > > > >> > > > The proper way to address your use case is to write a custom > > >> > > > FileBasedSource. > > >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <[email protected]> > > >> wrote: > > >> > > > > > >> > > > > The Javadoc of TextIO states: > > >> > > > > > > >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link > > PCollection} > > >> of > > >> > > > > {@link String Strings}, > > >> > > > > * each corresponding to one line of an input UTF-8 text file. > > To > > >> > > convert > > >> > > > > directly from the raw > > >> > > > > * bytes (split into lines delimited by '\n', '\r', or '\r\n') > > to > > >> > > another > > >> > > > > object of type {@code T}, > > >> > > > > * supply a {@code Coder<T>} using {@link > > >> > > TextIO.Read#withCoder(Coder)}. > > >> > > > > > > >> > > > > However, as I stated, `withCoder` doesn't seem to have tests, > > and > > >> > > > probably > > >> > > > > won't work given the hard-coded '\n' delimiter. > > >> > > > > > > >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré < > > >> > [email protected] > > >> > > > > > >> > > > > wrote: > > >> > > > > > > >> > > > > > Hi Aviem, > > >> > > > > > > > >> > > > > > TextIO is not designed to write/read binary file: it's pure > > > Text, > > >> > so > > >> > > > > > String. > > >> > > > > > > > >> > > > > > Regards > > >> > > > > > JB > > >> > > > > > > > >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote: > > >> > > > > > > Hi, > > >> > > > > > > > > >> > > > > > > While trying to use TextIO to write/read a binary file > > rather > > >> > than > > >> > > > > String > > >> > > > > > > lines from a textual file I ran into an issue - the > > delimiter > > >> > > TextIO > > >> > > > > uses > > >> > > > > > > seems to be hardcoded '\n'. > > >> > > > > > > See `findSeparatorBounds` - > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > > > > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024 > > >> > > > > > > > > >> > > > > > > The use case is to have a file of objects, encoded into > > bytes > > >> > > using a > > >> > > > > > > coder. However, '\n' is not a good delimiter here, as you > > can > > >> > > > imagine. > > >> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile` > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > > > > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512 > > >> > > > > > > where > > >> > > > > > > they use a more appropriate delimiter, to avoid such > issues. > > >> > > > > > > > > >> > > > > > > I did not find any unit tests which use TextIO to read > > > anything > > >> > > other > > >> > > > > > than > > >> > > > > > > Strings. > > >> > > > > > > > > >> > > > > > > > >> > > > > > -- > > >> > > > > > Jean-Baptiste Onofré > > >> > > > > > [email protected] > > >> > > > > > http://blog.nanthrax.net > > >> > > > > > Talend - http://www.talend.com > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >
