[jira] [Work logged] (BEAM-6347) Add page for developing I/O connectors for Java

ASF GitHub Bot (JIRA) Thu, 03 Jan 2019 19:53:10 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-6347?focusedWorklogId=180956&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-180956
 ]


ASF GitHub Bot logged work on BEAM-6347:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 04/Jan/19 03:52
            Start Date: 04/Jan/19 03:52
    Worklog Time Spent: 10m 
      Work Description: chamikaramj commented on pull request #7397: 
[BEAM-6347] Add website page for developing I/O connectors for Java
URL: https://github.com/apache/beam/pull/7397#discussion_r245191446
 
 

 ##########
 File path: website/src/documentation/io/developing-io-overview.md
 ##########
 @@ -0,0 +1,171 @@
+---
+layout: section
+title: "Overview: Developing a new I/O connector"
+section_menu: section-menu/documentation.html
+permalink: /documentation/io/developing-io-overview/
+redirect_from: /documentation/io/authoring-overview/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+[Pipeline I/O Table of Contents]({{site.baseurl}}/documentation/io/io-toc/)
+
+# Overview: Developing a new I/O connector
+
+_A guide for users who need to connect to a data store that isn't supported by
+the [Built-in I/O connectors]({{site.baseurl }}/documentation/io/built-in/)_
+
+To connect to a data store that isn’t supported by Beam’s existing I/O
+connectors, you must create a custom I/O connector. A connector usually 
consists
+of a source and a sink. All Beam sources and sinks are composite transforms;
+however, the implementation of your custom I/O depends on your use case. Here
+are the recommended steps to get started:
+
+1. Read this overview and choose your implementation. You can email the
+   [Beam dev mailing list]({{ site.baseurl }}/get-started/support) with any
+   questions you might have. In addition, you can check if anyone else is
+   working on the same I/O connector.  
+
+1. If you plan to contribute your I/O connector to the Beam community, see the
+   [Apache Beam contribution guide]({{ site.baseurl 
}}/contribute/contribution-guide/).  
+
+1. Read the [PTransform style guide]({{ site.baseurl 
}}/contribute/ptransform-style-guide/)
+   for additional style guide recommendations.
+
+
+## Implementation options
+
+### Sources
+
+For bounded (batch) sources, there are currently two options for creating a 
Beam
+source:
+
+1. Use `ParDo` and `GroupByKey`.  
+
+1. Use the `Source` interface and extend the `BoundedSource` abstract subclass.
+
+`ParDo` is the recommended option, as implementing a `Source` can be tricky. 
See
+[When to use the Source interface](#when-to-use-source) for a list of some use
+cases where you might want to use a `Source` (such as
+[dynamic work rebalancing]({{ site.baseurl 
}}/blog/2016/05/18/splitAtFraction-method.html)).
+
+(Java only) For unbounded (streaming) sources, you must use the `Source`
+interface and extend the `UnboundedSource` abstract subclass. `UnboundedSource`
+supports features that are useful for streaming pipelines, such as
+checkpointing.
+
+Splittable DoFn is a new sources framework that is under development and will
+replace the other options for developing bounded and unbounded sources. For 
more
+information, see the
+[roadmap for multi-SDK connector efforts]({{ site.baseurl 
}}/roadmap/connectors-multi-sdk/).
+
+### When to use the Source interface {#when-to-use-source}
+
+If you are not sure whether to use `Source`, feel free to email the [Beam dev
+mailing list]({{ site.baseurl }}/get-started/support) and we can discuss the
+specific pros and cons of your case.
+
+In some cases, implementing a `Source` might be necessary or result in better
+performance:
+
+* **Unbounded sources:** `ParDo` does not work for reading from unbounded
+  sources.  `ParDo` does not support checkpointing or mechanisms like de-duping
+  that are useful for streaming data sources.  
+
+* **Progress and size estimation:** `ParDo` can't provide hints to runners 
about
+  progress or the size of data they are reading. Without size estimation of the
+  data or progress on your read, the runner doesn't have any way to guess how
+  large your read will be, and thus if it attempts to dynamically allocate
+  workers, it does not have any clues as to how many workers you may need for
+  your pipeline.  
+
+* **Dynamic work rebalancing:** `ParDo` does not support dynamic work
+  rebalancing, which is used by some readers to improve the processing speed of
+  jobs. Depending on your data source, dynamic work rebalancing might not be
+  possible.  
+
+* **Splitting into parts of particular size recommended by the runner:** 
`ParDo`
+  does not receive `desired_bundle_size` as a hint from runners when performing
+  initial splitting.
+
+For example, if you'd like to read from a new file format that contains many
+records per file, or if you'd like to read from a key-value store that supports
+read operations in sorted key order.
+
+
+### Sinks
+
+To create a Beam sink, we recommend that you use a single `ParDo` that writes 
the
+received records to the data store. However, for file-based sinks, you can use
 
 Review comment:
   To create a Beam sink, we recommend that you use a ParDo that writes the 
received records to the data store and use ParDo, GroupByKey, and other 
transforms that available in Beam to develop more complex sinks (for example, 
to support data de-duplication when failures are retried by a runner). For 
file-based sinks, you can use the `FileBasedSink` abstraction offered by both 
Java and Python SDKs. Please see language specific guidelines for more details.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 180956)
    Time Spent: 50m  (was: 40m)

> Add page for developing I/O connectors for Java
> -----------------------------------------------
>
>                 Key: BEAM-6347
>                 URL: https://issues.apache.org/jira/browse/BEAM-6347
>             Project: Beam
>          Issue Type: Bug
>          Components: website
>            Reporter: Melissa Pashniak
>            Assignee: Melissa Pashniak
>            Priority: Minor
>          Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-6347) Add page for developing I/O connectors for Java

Reply via email to