[
https://issues.apache.org/jira/browse/BEAM-2661?focusedWorklogId=126311&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-126311
]
ASF GitHub Bot logged work on BEAM-2661:
----------------------------------------
Author: ASF GitHub Bot
Created on: 23/Jul/18 21:29
Start Date: 23/Jul/18 21:29
Worklog Time Spent: 10m
Work Description: timrobertson100 opened a new pull request #6021:
[BEAM-2661] Adds KuduIO
URL: https://github.com/apache/beam/pull/6021
Provides an implementation and tests for KuduIO.
Please note that design decisions have been captured on
[BEAM-2661](https://issues.apache.org/jira/browse/BEAM-2661).
This implementation follows similar design patterns to `CassandraIO` and
naming convention from `BigQueryIO`.
The decision to use mocking and faking services for the unit tests was not
taken lightly and will be replaced when Kudu offer an easier solution for Java
- see [KUDU-2411](https://issues.apache.org/jira/browse/KUDU-2411)
This implementation will benefit from the addition of authentication and the
`BoundedSource` could be replaced by a `DoFn`. I propose adding those at a
later date.
------------------------
Follow this checklist to help us incorporate your contribution quickly and
easily:
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue, if applicable. This will automatically link the pull request to the
issue.
- [ ] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
It will help us expedite review of your Pull Request if you tag someone
(e.g. `@username`) to look at it.
Post-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
--- | --- | --- | --- | --- | --- | --- | ---
Go | [](https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/lastCompletedBuild/)
| --- | --- | --- | --- | --- | ---
Java | [](https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark_Gradle/lastCompletedBuild/)
Python | [](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)
| --- | [](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
</br> [](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
| --- | --- | --- | ---
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 126311)
Time Spent: 10m
Remaining Estimate: 0h
> Add KuduIO
> ----------
>
> Key: BEAM-2661
> URL: https://issues.apache.org/jira/browse/BEAM-2661
> Project: Beam
> Issue Type: New Feature
> Components: io-ideas
> Reporter: Jean-Baptiste Onofré
> Assignee: Tim Robertson
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).
> This work is in progress [on this
> branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with
> design aspects documented below.
> h2. The API
> The {{KuduIO}} API requires the user to provide a function to convert objects
> into operations. This is similar to the {{JdbcIO}} but different to others,
> such as {{HBaseIO}} which requires a pre-transform stage beforehand to
> convert into the mutations to apply. It was originally intended to copy the
> {{HBaseIO}} approach, but this was not possible:
> # The Kudu
> [Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
> is a fat class, and is a subclass of {{KuduRpc<OperationResponse>}}. It
> holds RPC logic, callbacks and a Kudu client. Because of this the
> {{Operation}} does not serialize and furthermore, the logic for encoding the
> operations (Insert, Upsert etc) in the Kudu Java API are one way only (no
> decode) because the server is written in C++.
> # An alternative could be to introduce a new object to beam (e.g.
> {{o.a.b.sdk.io.kudu.KuduOperation}}) to enable
> {{PCollection<KuduOperation>}}. This was considered but was discounted
> because:
> ## It is not a familiar API to those already knowing Kudu
> ## It still requires serialization and deserialization of the operations.
> Using the existing Kudu approach of serializing into compact byte arrays
> would require a decoder along the lines of [this almost complete
> example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e].
> This is possible but has fragilities given the Kudu code itself continues to
> evolve.
> ## It becomes a trivial codebase in Beam to maintain by defer the object to
> mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the
> precedent to do this.
> h2. Testing framework
> {{Kudu}} is written in C++. While a
> [TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java]
> does exist in Java, it requires binaries to be available for the target
> environment which is not portable (edit: this is now a [work in
> progress|https://issues.apache.org/jira/browse/KUDU-2411] in Kudu). Therefore
> we opt for the following:
> # Unit tests will use a mock Kudu client
> # Integration tests will cover the full aspects of the {{KuduIO}} and use a
> Docker based Kudu instance
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)