[
https://issues.apache.org/jira/browse/BEAM-73?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kenneth Knowles updated BEAM-73:
--------------------------------
Fix Version/s: First stable release
> IO design pattern: Decouple Parsers and Coders
> ----------------------------------------------
>
> Key: BEAM-73
> URL: https://issues.apache.org/jira/browse/BEAM-73
> Project: Beam
> Issue Type: New Feature
> Components: sdk-java-core
> Reporter: Daniel Halperin
> Priority: Minor
> Labels: backward-incompatible
> Fix For: First stable release
>
>
> Many Sources can be thought of as providing a byte[] payload -- e.g. TextIO
> bytes between newlines, or PubSubIO messages. Therefore, we originally
> suggested a Coder as the thing to use to decode these byte[] into T (what
> I'll call Parsing).
> Consider the case of a text file of integers.
> 123\n
> 456\n
> ...
> We want a PCollection<Integer> out, so we can use TextualIntegerCoder with
> TextIO.Read. However, that Coder will get propagated as the default coder for
> that PCollection (and may be used in downstream DoFns). This seem bad as,
> once the data is parsed, we probably want to use VarIntCoder or another Coder
> that is more CPU- and Space-efficient.
> Another design pattern is
> TextIO.Read() -> MapElements<String, Integer> (lambda s :
> Integer.parseInt(s))
> This has better behavior, but now we go from byte[] to String to Integer
> rather than directly from byte[] to Integer.
> The solution seems to be to explicitly add Parser and Coder abstractions.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)