[
https://issues.apache.org/jira/browse/BEAM-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ismaël Mejía updated BEAM-2661:
-------------------------------
Component/s: io-java-kudu
> Add KuduIO
> ----------
>
> Key: BEAM-2661
> URL: https://issues.apache.org/jira/browse/BEAM-2661
> Project: Beam
> Issue Type: New Feature
> Components: io-ideas, io-java-kudu
> Reporter: Jean-Baptiste Onofré
> Assignee: Tim Robertson
> Priority: Major
> Fix For: 2.7.0
>
> Time Spent: 4h 50m
> Remaining Estimate: 0h
>
> New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).
> This work is in progress [on this
> branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with
> design aspects documented below.
> h2. The API
> The {{KuduIO}} API requires the user to provide a function to convert objects
> into operations. This is similar to the {{JdbcIO}} but different to others,
> such as {{HBaseIO}} which requires a pre-transform stage beforehand to
> convert into the mutations to apply. It was originally intended to copy the
> {{HBaseIO}} approach, but this was not possible:
> # The Kudu
> [Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
> is a fat class, and is a subclass of {{KuduRpc<OperationResponse>}}. It
> holds RPC logic, callbacks and a Kudu client. Because of this the
> {{Operation}} does not serialize and furthermore, the logic for encoding the
> operations (Insert, Upsert etc) in the Kudu Java API are one way only (no
> decode) because the server is written in C++.
> # An alternative could be to introduce a new object to beam (e.g.
> {{o.a.b.sdk.io.kudu.KuduOperation}}) to enable
> {{PCollection<KuduOperation>}}. This was considered but was discounted
> because:
> ## It is not a familiar API to those already knowing Kudu
> ## It still requires serialization and deserialization of the operations.
> Using the existing Kudu approach of serializing into compact byte arrays
> would require a decoder along the lines of [this almost complete
> example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e].
> This is possible but has fragilities given the Kudu code itself continues to
> evolve.
> ## It becomes a trivial codebase in Beam to maintain by defer the object to
> mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the
> precedent to do this.
> h2. Testing framework
> {{Kudu}} is written in C++. While a
> [TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java]
> does exist in Java, it requires binaries to be available for the target
> environment which is not portable (edit: this is now a [work in
> progress|https://issues.apache.org/jira/browse/KUDU-2411] in Kudu). Therefore
> we opt for the following:
> # Unit tests will use a mock Kudu client
> # Integration tests will cover the full aspects of the {{KuduIO}} and use a
> Docker based Kudu instance
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)