Stas' comment is the right one. The "canonical" use of TextIO is using something like a TextualIntegerCoder <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/TextualIntegerCoder.java>, but that should almost certainly be replaced with TextIO.Read | ParDo.of(Parse integer). The `withCoder` functions need to get removed or replaced.
For "holding a file of arbitrary records" -- simply producing a delimiter-separated TextIO is probably not a good choice. Specifically, splitting is broken when the delimiter might appear in the output (e.g., when using almost any coder). A better option is to design a file format to hold arbitrary records. E.g., an Avro file where each record is just a byte[]. Dan On Mon, Jan 30, 2017 at 2:52 AM, Aviem Zur <[email protected]> wrote: > The Javadoc of TextIO states: > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of > {@link String Strings}, > * each corresponding to one line of an input UTF-8 text file. To convert > directly from the raw > * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another > object of type {@code T}, > * supply a {@code Coder<T>} using {@link TextIO.Read#withCoder(Coder)}. > > However, as I stated, `withCoder` doesn't seem to have tests, and probably > won't work given the hard-coded '\n' delimiter. > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <[email protected]> > wrote: > > > Hi Aviem, > > > > TextIO is not designed to write/read binary file: it's pure Text, so > > String. > > > > Regards > > JB > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote: > > > Hi, > > > > > > While trying to use TextIO to write/read a binary file rather than > String > > > lines from a textual file I ran into an issue - the delimiter TextIO > uses > > > seems to be hardcoded '\n'. > > > See `findSeparatorBounds` - > > > > > https://github.com/apache/beam/blob/master/sdks/java/ > core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024 > > > > > > The use case is to have a file of objects, encoded into bytes using a > > > coder. However, '\n' is not a good delimiter here, as you can imagine. > > > A similar pattern is found in Spark's `saveAsObjectFile` > > > > > https://github.com/apache/spark/blob/master/core/src/ > main/scala/org/apache/spark/rdd/RDD.scala#L1512 > > > where > > > they use a more appropriate delimiter, to avoid such issues. > > > > > > I did not find any unit tests which use TextIO to read anything other > > than > > > Strings. > > > > > > > -- > > Jean-Baptiste Onofré > > [email protected] > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > >
