On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <aviem...@gmail.com> wrote:
> +1 on what Stas said.
> I think there is value in not having the user write a custom IO for a
> protocol they use which is not covered by Beam IOs. Plus having them deal
> with not only the encoding but also the IO part is not ideal.
> I think having a basic FileIO that can write to the Filesystems supported
> by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> your own custom coder, can be beneficial.

What would the format of the file be? Just the concatenation of the
elements encoded according to the coder? Or is there a delimiter
needed to separate records. In which case how does one ensure the
delimiter does not also appear in the middle of an encoded element? At
this point you're developing a file format, and might as well stick
with one of the standard ones. https://xkcd.com/927

> On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <stasle...@gmail.com> wrote:
>
> I believe the motivation is to have an abstraction that allows one to write
> stuff to a file in a way that is agnostic to the coder.
> If one needs to write a non-Avro protocol to a file, and this particular
> protocol does not meet the assumption made by TextIO, one might need to
> duplicate the file IO related code from AvroIO.
>
> On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> <kirpic...@google.com.invalid> wrote:
>
>> Could you clarify why it would be useful to write objects to files using
>> Beam coders, as opposed to just using e.g. AvroIO?
>>
>> Coders (should) make no promise as to what their wire format is, so such
>> files could be read back only by other Beam pipelines using the same IO.
>>
>> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <aviem...@gmail.com> wrote:
>>
>> > So If I understand the general agreement is that TextIO should not
>> support
>> > anything but lines from files as strings.
>> > I'll go ahead and file a ticket that says the Javadoc should be changed
>> to
>> > reflect this and `withCoder` method should be removed.
>> >
>> > Is there merit for Beam to supply an IO which does allow writing objects
>> to
>> > a file using Beam coders and Beam FS (To write these files to
>> > GS/Hadoop/Local)?
>> >
>> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
>> > <kirpic...@google.com.invalid> wrote:
>> >
>> > P.S. Note that this point (about coders) is also mentioned in the
>> > now-being-reviewed PTransform Style Guide
>> > https://github.com/apache/beam-site/pull/134
>> > currently staged at
>> >
>> >
>>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
>> >
>> >
>> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <chamik...@apache.org
>> >
>> > wrote:
>> >
>> > > +1 to what Eugene said.
>> > >
>> > > I've seen a number of Python SDK users incorrectly assuming that
>> > > coder.decode() is needed when developing their own file-based sources
>> > > (since many users usually refer to text source first). Probably coder
>> > > parameter should not be configurable for text source/sink and they
>> should
>> > > be updated to only read/write UTF-8 encoded strings.
>> > >
>> > > - Cham
>> > >
>> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
>> > > <kirpic...@google.com.invalid> wrote:
>> > >
>> > > > The use of Coder in TextIO is a long standing design issue because
>> > coders
>> > > > are not intended to be used for general purpose converting things
>> from
>> > > and
>> > > > to bytes, their only proper use is letting the runner materialize
> and
>> > > > restore objects if the runner thinks it's necessary. IMO it should
>> have
>> > > > been called LineIO, document that it reads lines of text as String,
>> and
>> > > not
>> > > > have a withCoder parameter at all.
>> > > >
>> > > > The proper way to address your use case is to write a custom
>> > > > FileBasedSource.
>> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <aviem...@gmail.com>
>> wrote:
>> > > >
>> > > > > The Javadoc of TextIO states:
>> > > > >
>> > > > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection}
>> of
>> > > > > {@link String Strings},
>> > > > >  * each corresponding to one line of an input UTF-8 text file. To
>> > > convert
>> > > > > directly from the raw
>> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
>> > > another
>> > > > > object of type {@code T},
>> > > > >  * supply a {@code Coder<T>} using {@link
>> > > TextIO.Read#withCoder(Coder)}.
>> > > > >
>> > > > > However, as I stated, `withCoder` doesn't seem to have tests, and
>> > > > probably
>> > > > > won't work given the hard-coded '\n' delimiter.
>> > > > >
>> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
>> > j...@nanthrax.net
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Aviem,
>> > > > > >
>> > > > > > TextIO is not designed to write/read binary file: it's pure
> Text,
>> > so
>> > > > > > String.
>> > > > > >
>> > > > > > Regards
>> > > > > > JB
>> > > > > >
>> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
>> > > > > > > Hi,
>> > > > > > >
>> > > > > > > While trying to use TextIO to write/read a binary file rather
>> > than
>> > > > > String
>> > > > > > > lines from a textual file I ran into an issue - the delimiter
>> > > TextIO
>> > > > > uses
>> > > > > > > seems to be hardcoded '\n'.
>> > > > > > > See `findSeparatorBounds` -
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
>> > > > > > >
>> > > > > > > The use case is to have a file of objects, encoded into bytes
>> > > using a
>> > > > > > > coder. However, '\n' is not a good delimiter here, as you can
>> > > > imagine.
>> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
>> > > > > > > where
>> > > > > > > they use a more appropriate delimiter, to avoid such issues.
>> > > > > > >
>> > > > > > > I did not find any unit tests which use TextIO to read
> anything
>> > > other
>> > > > > > than
>> > > > > > > Strings.
>> > > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Jean-Baptiste Onofré
>> > > > > > jbono...@apache.org
>> > > > > > http://blog.nanthrax.net
>> > > > > > Talend - http://www.talend.com
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Reply via email to