Re: TextIO binary file

2017-02-06 Thread Robert Bradshaw
I agree documentation should be better around this. If we decided to go down the route of adding FileObjectIO, I would suggest making it a PTransform that wraps an AvroIO rather than a new file format (unless it can be demonstrated that there are significant performance concerns). In this case on

Re: TextIO binary file

2017-02-06 Thread Aviem Zur
I believe these are actually several different use cases with different paths (Some do not exist today): User wants to output results of a pipeline to be used in a different pipeline: (This is indeed a use case in my organization. Also, Spark sees this as a use case as well with its `saveAsObjectF

Re: TextIO binary file

2017-02-05 Thread Eugene Kirpichov
Hmm, do you have a concrete use case in mind, where all these circumstances come together? - There is a need to write data from one Beam pipeline, and read it from another Beam pipeline - These pipelines have to be kept separate, rather than merged into one pipeline - It is okay that the data can n

Re: TextIO binary file

2017-02-05 Thread Aviem Zur
AvroIO would is great for POJOs. But for use cases with more complex, serializable objects, or objects which are compatible with some coder it falls short. Also, for less savvy users to know they need to use AvroIO might be a stretch. Some simpler API along the the lines of ObjectFile might be mor

Re: TextIO binary file

2017-02-05 Thread Eugene Kirpichov
OK, I see what you mean; however I still think this can be solved without introducing a new "Beam object file" (or whatever) file format, and without thereby introducing additional use cases and compatibility constraints on coders. I asked before in the thread why not just use AvroIO (it can seria

Re: TextIO binary file

2017-02-05 Thread Aviem Zur
I agree that these files will serve no use outside of Beam pipelines. The rationale was that you might want to have one pipeline write output to files and then have a different pipeline that uses those files as inputs. Say one team in your organization creates a pipeline and a different team util

Re: TextIO binary file

2017-01-31 Thread Eugene Kirpichov
+1 to Robert. Either this will be a Beam-specific file format (and then nothing except Beam will be able to read it - which I doubt is what you want), or it is an existing well-known file format and then we should just develop an IO for it. Note that any file format that involves encoding elements

Re: TextIO binary file

2017-01-31 Thread Robert Bradshaw
On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur wrote: > +1 on what Stas said. > I think there is value in not having the user write a custom IO for a > protocol they use which is not covered by Beam IOs. Plus having them deal > with not only the encoding but also the IO part is not ideal. > I think h

Re: TextIO binary file

2017-01-31 Thread Aviem Zur
+1 on what Stas said. I think there is value in not having the user write a custom IO for a protocol they use which is not covered by Beam IOs. Plus having them deal with not only the encoding but also the IO part is not ideal. I think having a basic FileIO that can write to the Filesystems support

Re: TextIO binary file

2017-01-31 Thread Stas Levin
I believe the motivation is to have an abstraction that allows one to write stuff to a file in a way that is agnostic to the coder. If one needs to write a non-Avro protocol to a file, and this particular protocol does not meet the assumption made by TextIO, one might need to duplicate the file IO

Re: TextIO binary file

2017-01-31 Thread Eugene Kirpichov
Could you clarify why it would be useful to write objects to files using Beam coders, as opposed to just using e.g. AvroIO? Coders (should) make no promise as to what their wire format is, so such files could be read back only by other Beam pipelines using the same IO. On Tue, Jan 31, 2017 at 2:4

Re: TextIO binary file

2017-01-31 Thread Aviem Zur
Looks like Eugene addressed this in the following ticket: https://issues.apache.org/jira/browse/BEAM-1354 Just added a bullet regarding updating the javadoc. On Tue, Jan 31, 2017 at 12:47 PM Aviem Zur wrote: > So If I understand the general agreement is that TextIO should not support > anything

Re: TextIO binary file

2017-01-31 Thread Aviem Zur
So If I understand the general agreement is that TextIO should not support anything but lines from files as strings. I'll go ahead and file a ticket that says the Javadoc should be changed to reflect this and `withCoder` method should be removed. Is there merit for Beam to supply an IO which does

Re: TextIO binary file

2017-01-30 Thread Eugene Kirpichov
P.S. Note that this point (about coders) is also mentioned in the now-being-reviewed PTransform Style Guide https://github.com/apache/beam-site/pull/134 currently staged at http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders On M

Re: TextIO binary file

2017-01-30 Thread Chamikara Jayalath
+1 to what Eugene said. I've seen a number of Python SDK users incorrectly assuming that coder.decode() is needed when developing their own file-based sources (since many users usually refer to text source first). Probably coder parameter should not be configurable for text source/sink and they sh

Re: TextIO binary file

2017-01-30 Thread Eugene Kirpichov
The use of Coder in TextIO is a long standing design issue because coders are not intended to be used for general purpose converting things from and to bytes, their only proper use is letting the runner materialize and restore objects if the runner thinks it's necessary. IMO it should have been cal

Re: TextIO binary file

2017-01-30 Thread Dan Halperin
Stas' comment is the right one. The "canonical" use of TextIO is using something like a TextualIntegerCoder , but that should almost certainly be replaced with TextIO.Read |

Re: TextIO binary file

2017-01-30 Thread Aviem Zur
The Javadoc of TextIO states: * By default, {@link TextIO.Read} returns a {@link PCollection} of {@link String Strings}, * each corresponding to one line of an input UTF-8 text file. To convert directly from the raw * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another object

Re: TextIO binary file

2017-01-30 Thread Jean-Baptiste Onofré
Hi Aviem, TextIO is not designed to write/read binary file: it's pure Text, so String. Regards JB On 01/30/2017 09:24 AM, Aviem Zur wrote: Hi, While trying to use TextIO to write/read a binary file rather than String lines from a textual file I ran into an issue - the delimiter TextIO uses se

Re: TextIO binary file

2017-01-30 Thread Stas Levin
Dear Aviem, That's a good point. TextIO seems to make a few "string oriented" assumptions, see TextIO.Write's header and footer support , and IO design pattern: Decouple Parsers and Coders (BEAM-73) that came