Repository: samza Updated Branches: refs/heads/master c1e03e65f -> 3f9bbb07b
Adding release notes for Samza 1.0 <img width="1280" alt="screen shot 2018-10-26 at 7 25 58 pm" src="https://user-images.githubusercontent.com/40646191/47598776-11921000-d956-11e8-8863-ce245e1d0cb2.png"> <img width="1280" alt="screen shot 2018-10-26 at 7 26 06 pm" src="https://user-images.githubusercontent.com/40646191/47598777-11921000-d956-11e8-884b-3b5acbdbfb94.png"> <img width="1280" alt="screen shot 2018-10-26 at 7 26 08 pm" src="https://user-images.githubusercontent.com/40646191/47598778-11921000-d956-11e8-9428-8a8f11f3d1e8.png"> <img width="1280" alt="screen shot 2018-10-26 at 7 26 10 pm" src="https://user-images.githubusercontent.com/40646191/47598779-11921000-d956-11e8-99ee-9c77349417f2.png"> <img width="1280" alt="screen shot 2018-10-26 at 7 26 13 pm" src="https://user-images.githubusercontent.com/40646191/47598780-122aa680-d956-11e8-9fbc-98527b099a6a.png"> <img width="1280" alt="screen shot 2018-10-26 at 7 26 17 pm" src="https://user-images.githubusercontent.com/40646191/47598781-122aa680-d956-11e8-8c81-b113fbddbd8a.png"> Author: [email protected] <[email protected]> Reviewers: Prateek Maheshwari <[email protected]> Closes #776 from rmatharu/releasenotes Project: http://git-wip-us.apache.org/repos/asf/samza/repo Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/3f9bbb07 Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/3f9bbb07 Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/3f9bbb07 Branch: refs/heads/master Commit: 3f9bbb07b562179704a953c4bf840c32f635a98e Parents: c1e03e6 Author: [email protected] <[email protected]> Authored: Tue Oct 30 15:54:25 2018 -0700 Committer: Prateek Maheshwari <[email protected]> Committed: Tue Oct 30 15:54:25 2018 -0700 ---------------------------------------------------------------------- docs/_releases/1.0.0.md | 500 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 500 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/samza/blob/3f9bbb07/docs/_releases/1.0.0.md ---------------------------------------------------------------------- diff --git a/docs/_releases/1.0.0.md b/docs/_releases/1.0.0.md new file mode 100644 index 0000000..c876404 --- /dev/null +++ b/docs/_releases/1.0.0.md @@ -0,0 +1,500 @@ +--- +version: '1.0' +order: 10 +layout: page +menu_title: '1.0' +title: Apache Samza 1.0 +--- +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + + +We're thrilled to announce to the release of Apache Samza 1.0. + +Today Samza forms the backbone of hundreds of real-time production +applications across a multitude of companies, such as LinkedIn, VMWare, +Slack, Redfin among many others. This release of Samza adds a variety of +features and capabilities to Samza's existing arsenal, coupled with new +and improved *documentation*, *code* snippets, *examples*, and a +brand-new *website design*!\ +Here are a few selected highlights: + +- **Stable** high level APIs that allow creating complex processing + pipelines with ease. + +- **Beam Samza Runner** now marries Beam's best in class support for + EventTime based windowed processing and sophisticated triggering + with Samza's stable and scalable stateful processing model. + +- **Table API** that provides a common abstraction for accessing + remote or local databases. Developers are now able to "join" an + input event stream with such a Table. + +- Integration **Test Framework** to enable effortless testing of Samza + jobs without deploying a Kafka, Yarn, or Zookeeper cluster. + +- Support for **Apache Log4j2** allowing improved logging performance, + customization, and efficiency. + +- **Upgraded Kafka** client and consumer. + +- An interactive **shell for Samza SQL** for seamless formulation, + development, and testing of SamzaSQL queries. + +- **Side-input** support that allows using log-compacted data sources + to populate KV state for Samza applications. + +- **An improved website** with detailed documentation and lots of code + samples! + +In addition, Samza 1.0 brings numerous bug-fixes, upgrades, and +improvements listed [below](#jiralist). + +## New features + +Samza 1.0 brings full-feature support for the following: + +### Improved Stable High Level APIs + +Samza 1.0 brings *Descriptor APIs* that allows applications to specify +their input and output *systems* and *streams* in code. Samza's new +*Context APIs* provide applications unified access to job-level, +container-level, task-level, and application-level context and +capabilities. This also simplifies Samza's *ApplicationRunner* +interface. + +This API evolution requires a few simple modifications to application +code, which we describe in detail in our [upgrade steps](#upgradesteps) + +#### Beam Runner Support + +Samza's Beam Runner enables executing Beam pipelines over Samza. This +enables Samza applications to create complex processing pipelines that +require event-time based processing, varying types of event-time based +windowing, and more. This feature is supported in both the YARN and +standalone deployment models. + +#### Joining Streams and Tables +Samza's Table API provides developers with unified access to local +and remote data sources such as *Key-Value stores* or *web services*, +while providing features such as *rate-limiting, throttling,* and +*caching* capabilities. This provides first-class API primitives for +building Stream-Table join jobs. Learn more about the use, semantics, +and examples for Table API [here](http://TODO). + +#### Test Samza without *ZK, Yarn* or *Kafka* + +Samza 1.0 brings a test framework that allows testing Samza applications +using *in-memory* input and output. Users can now setup test and +testing pipelines for their applications without needing to setup any +other services, such as Kafka, YARN, or Zookeeper. + +#### Log4J2 support + +Samza now supports Apache Log4j 2 for system and application logging. +Log4j 2 is an upgrade to Log4j that provides significant improvements +over its predecessor, Log4j 1.x, such as better throughput and latency, +custom log levels, and a pluggable logging architecture. + +#### Kafka upgrade + +This release upgrades Samza to use Kafka's high-level consumer (Kafka +v0.11.1.62). This brings latency and throughput benefits for Samza +applications that consume from Kafka, in addition to bug-fixes. This +also means Samza applications can now better their utilization of the +underlying Kafka cluster. + +#### SamzaSQL Shell + +SamzaSQL now provides a shell for users to type-in their SQL queries, +while Samza does the heavy-lifting of wiring the inputs and outputs, and +sizing the application in the background. This is great for testing and +experimenting with queries while formulating your application-logic, +specially suited for data-scientists and tinkerers. + +#### Side-inputs + +Samza 1.0 brings the ability to leverage existing log-compacted data +sources (e.g., Kafka topics) to populate KV state for Samza +applications. If your data processing pipeline involves Hadoop-to-Kafka +push, this feature alleviates the need for your Samza job to create +separate Kafka-topics to back KV state. + +#### Improved website, documentation, and samples + +We've re-designed the Samza website making it easier to find details on +key Samza concepts and patterns. All documentation has been revised and +rewritten, keeping in mind the feedback we got from our customers. We've +revised and added sample application code to showcase Samza 1.0 and the +use of its new APIs. + +<!--------------------------------------------------------------------------> +## <a name="jiralist"></a> Enhancements and Upgrades + +This release brings the following enhancements, upgrades, and +capabilities: + +##### New and improved documentation, code snippets, and examples + +### API enhancements and simplifications + + +SAMZA-1789: unify ApplicationDescriptor and ApplicationRunner for high- +and low-level APIs in YARN and standalone environment + +SAMZA-1804: System and stream descriptors + +SAMZA-1858: Public APIs for shared context + +SAMZA-1763: Add async methods to Table API + +SAMZA-1786: Introduce the metadata store abstraction + +SAMZA-1859: Zookeeper implementation of MetadataStore + +SAMZA-1788: Add the LocationIdProvider abstraction + +### Upgrades and Bug-fixes +SAMZA-1768: Handle corrupted OFFSET file + +SAMZA-1817: Long classpath support for non-split deployments +SAMZA-1719: Add caching support to table-API + +SAMZA-1783: Add Log4j2 functionality in Samza + +SAMZA-1868: Refactor KafkaSystemAdmin from using SimpleConsumer + +SAMZA-1776: Refactor KafkaSystemConsumer to remove the usage of +deprecated SimpleConsumer client + +SAMZA-1730: Adding state validation in StreamProcessor before any +lifecycle operation and group coordination + +SAMZA-1695: Clear events in ScheduleAfterDebounceTime on session +expiration + +SAMZA-1647: Fix race conditions in StreamProcessor + +SAMZA-1371: Some Samza Containers get stuck at \"Starting BrokerProxy\" + +**Performance and Testing** + +SAMZA-1648: Integration Test Framework & Collection Stream Impl + +SAMZA-1748: Failure tests in the standalone deployment + +A source download of Samza 1.0 is available [here](http://TODO), and is +also available in Apache's Maven repository. + +**Community Developments**\ +A [symposium](https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797) +on Stream processing with Apache Samza and Apache Kafka was held on July +19th and on October 23rd. Both were attended by more than 350 +participants from across the industry. It featured in-depth talks on +Samza's Beam integration, its use at LinkedIn for real-time +notifications, a talk on Kafka-replication at Uber, and Kafka cruise +control, and many others. + +Samza was also the focus of a talk at [Strange Loop'18](https://www.youtube.com/watch?v=2y8QImf-RpI), +focussing in depth on its scalability, performance, extensibility, and +programmability. + +## <a name="upgradesteps"></a> Upgrading your application to Apache Samza 1.0 + +Congratulations on your decision to upgrade to Samza 1.0! + +First, please follow the upgrade steps depending on if you've been using the +low-level (i.e., Task) API or the high-level (i.e., the fluent) Samza API. + +### Overview of New APIs + +Instead of configuration, input and output systems are now defined using +*SystemDescriptors.* Each SystemDescriptor encapsulates the vagaries of +a particular type of system, e.g., Kafka, EventHub, Brooklin, etc. These +system descriptors are then used to get *StreamDescriptors* for input +and output streams, instead of \<however we think this was done +before\>. Each *StreamDescriptor* encapsulates vagaries of the stream, +such as its key-value spec, the serde it requires, etc. + +Applications access all Context-based objects, such as Stores, +Configuration, MetricsRegistry, Tables, TaskModel, CallbackScheduler, +etc, via a unified Context-API. + +**Upgrade Steps for Low-level API Applications** + +1. Update your application to use Samza version 1.0.0. + +2. Samza 1.0's Context API necessitates some minimal changes to the init method + of your Task class (the one that implements the **StreamTask** + interface). + + The signature of the init method now has a single Context argument. + + So, to obtain the, TaskModel and details (e.g., TaskName) in your task's init, + use context.getTaskContext().getTaskModel() + + For the MetricsRegistry for the task, use context.getTaskContext().getTaskMetricsRegistry() + + For the Application's configuration, use context.getJobContext().getConfig() + + For KV stores, use context.getTaskContext().getStore(storeName) + + For Table objects, use context.getTaskContext().getTable(tableId) + + For scheduling a timer-callback from your task, use context.getTaskContext(). + getCallbackScheduler().scheduleCallback(key, timestamp, callback). + + For the ContainerModel, use context.getContainerContext().getContainerModel() + + For the MetricsRegistry for the container, use context.getContainerContext().getContainerMetricsRegistry() + +3. In Samza 1.0, a Samza application's input, output, and + processing-task should be specified in code, rather than in + config. To do that, create a class (say MySamzaTaskApplication) + that implements the **TaskApplication** interface. In the + **describe()** method of this class do the following: + + a. For each system your job inputs or outputs to (e.g., Kafka), define a **SystemDescriptor.** + + For example, for Kafka, + + ``` + KafkaSystemDescriptor kafkaSystemDescriptor = new KafkaSystemDescriptor("kafka"); + ``` + + Similarly, for other systems such as EventHub, use + + ``` + EventHubsSystemDescriptor eventHubsSystemDescriptor = new EventHubsSystemDescriptor("eventhub") + ``` + + For HDFS use a GenericSystemDescriptor, + + ``` + GenericSystemDescriptor hdfsDescriptor = new GenericSystemDescriptor("hdfs", "org.apache.samza.system.hdfs.HdfsSystemFactory"); + ``` + + You may choose to use config variables for the system-names and the system-factories. + + b. *For each input* of your job, create an input descriptor, + + For example, for Kafka input, + + ``` + KafkaInputDescriptor<KV<String, PageViewEvent>> inputDescriptor = kafkaSystemDescriptor.getInputDescriptor("streamID", new JsonSerde<>()); + ``` + + For EventHubs input, + + ``` + EventHubsInputDescriptor eventHubsInputDescriptor = eventHubsSystemDescriptor.getInputDescriptor("streamID", "namespace", "entityPath", new MyFavouriteSerde()); + ``` + + For other input types simply use a GenericInputDescriptor, see samza-hello-world for examples for details on usage. + + c. *For each output* of your job, create an output descriptor, + + For example, for Kafka output, + + ``` + KafkaOutputDescriptor kafkaOutputDescriptor = kafkaSystemDescriptor.getOutputDescriptor("streamID", new JsonSerde<>()); + ``` + + For EventHub output, + + ``` + EventHubsOutputDescriptor eventHubsOutputDescriptor = eventHubsSystemDescriptor.getOutputDescriptor("streamID", "namespace", "entityPath", new MyFavouriteSerde()); + ``` + + For other output types simply use a GenericInputDescriptor, see + samza-hello-world for examples and details on usage. + + d. Add the input and output descriptors by, + + ``` + appDesc.withInputStream(inputDescriptor); + appDesc.withOutputStream(outputDescriptor); + ``` + + e. Set the Task factory and point it to your pre-existing Task class. + + ``` + appDesc.withTaskFactory((StreamTaskFactory) () = new MyExistingSamzaTask()); + ``` + +4. In your application's config *set the following config* to point to the + TaskApplication class implemented above: + + ``` + <property name="app.class" value="class-path-to-TaskApplication-class"> + ``` + + Note that this config is called **app.class** and **not** task.class. + +5. In your application's config set the following *task.systems* config. This + config should be a union of all input and output systems used by + your application. + + For example, an application that reads or writes from *kafka*, + *wikipedia* should use: + + ``` + <property name="task.systems" value="kafka, wikipedia"/> + ``` + Note that, the system names used here *should* **match the system names** used when creating the *SystemDescriptor* above in 3.a. + +6. *Remove* the *task.class* and *task.inputs* configs from your job's configuration. These configs are *not needed* in Samza 1.0. + +**Upgrade Steps for** **High-level API** **Applications** + +1. Update your application to use samza version 1.0.0. + +2. In Samza 1.0, a Samza application's input, output, and + processing-task is specified in code, rather than in config. So to + do that, update your **StreamApplication** class: + + a. *For each system* your job inputs or outputs to, define a **SystemDescriptor**, + For each system your job inputs or outputs to (e.g., Kafka), define a **SystemDescriptor.** + + For example, for Kafka, + + ``` + KafkaSystemDescriptor kafkaSystemDescriptor = new KafkaSystemDescriptor("kafka"); + ``` + + Similarly, for other systems such as EventHub, use + + ``` + EventHubsSystemDescriptor eventHubsSystemDescriptor = new EventHubsSystemDescriptor("eventhub") + ``` + + For HDFS use a GenericSystemDescriptor, + + ``` + GenericSystemDescriptor hdfsDescriptor = new GenericSystemDescriptor("hdfs", "org.apache.samza.system.hdfs.HdfsSystemFactory"); + ``` + You may choose to use config variables for the system-names and the system-factories. + + b. *For each input* of your job, define an input descriptor, + For example, for Kafka input, + + ``` + KafkaInputDescriptor<KV<String, PageViewEvent>> inputDescriptor = kafkaSystemDescriptor.getInputDescriptor("streamID", new JsonSerde<>()); + ``` + + For EventHubs input, + + ``` + EventHubsInputDescriptor eventHubsInputDescriptor = eventHubsSystemDescriptor.getInputDescriptor("streamID", "namespace", "entityPath", new MyFavouriteSerde()); + ``` + + For other input types simply use a GenericInputDescriptor, see samza-hello-world for examples for details on usage. + + c. *For each output* of your job, define an output descriptor, + + For example, for Kafka output, + + ``` + KafkaOutputDescriptor kafkaOutputDescriptor = kafkaSystemDescriptor.getOutputDescriptor("streamID", new JsonSerde<>()); + ``` + + For EventHub output, + + ``` + EventHubsOutputDescriptor eventHubsOutputDescriptor = eventHubsSystemDescriptor.getOutputDescriptor("streamID", "namespace", "entityPath", new MyFavouriteSerde()); + ``` + + For other output types simply use a GenericInputDescriptor, see + samza-hello-world for examples and details on usage. + + d. Obtain the input and output stream objects by using the input and output descriptors, + + For example, + + ``` + MessageStream<PageViewEvent> inputStream = appDesc.getInputStream(pageViewInputDescriptor) + ``` + + Or, + + + ``` + MessageStream<GenericRecord> outputStream = appDesc.getOutputStream(outputDescriptor) + ``` + + The type of the MessageStream will depend on the type used in the respective input or output descriptor. + + e. Update your fluent expressions to use the input and output streams from d. + +3. In Samza 1.0, the framework creates a computation graph of each + application, based on the application's fluent API logic. + Therefore, all *operators* and *user-defined functions (UDFs)* + used in an app's *describe()* method need to be specified in + static classes and methods. + +4. Add *task.systems* to your application configuration. This + config should be a union of all input and output systems used by + your application. + + For example, an application that reads or writes from *kafka*, + *wikipedia* should use: + + ``` + <property name="task.systems" value="kafka, wikipedia"/> + ``` + Note that, the system names used here *should* **match the system names** + used when creating the *SystemDescriptor* above in 2. + +5. *Remove* the *task.inputs* configs from your job's configuration. + These configs are *not needed* in Samza 1.0. + +6. If you are implementing InitableFunction, then you will need to + make a change, since it now only has a single Context argument. + + a. Samza 1.0's context API necessitates some minimal changes to the init method. + The signature of the init method now has a single Context argument. + + So, to obtain the, TaskModel and details (e.g., TaskName) in your task's init, + use context.getTaskContext().getTaskModel() + + For the MetricsRegistry for the task, use context.getTaskContext().getTaskMetricsRegistry() + + For the Application's configuration, use context.getJobContext().getConfig() + + For KV stores, use context.getTaskContext().getStore(storeName) + + For Table objects, use context.getTaskContext().getTable(tableId) + + For scheduling a timer-callback from your task, use context.getTaskContext(). + getCallbackScheduler().scheduleCallback(key, timestamp, callback). + + For the ContainerModel, use context.getContainerContext().getContainerModel() + + For the MetricsRegistry for the container, use context.getContainerContext().getContainerMetricsRegistry() + + b. If you are using ContextManager, then you will need to use + ApplicationContainerContextFactory and/or + ApplicationTaskContextFactory instead. You can access your + application-defined context(s) in InitiableFunction.init + through the Context\#getApplicationContainerContext and/or + Context\#getApplicationTaskContext. + + + +**Upgrade Steps for Samza SQL** + +No changes are required for upgrading your application Samza 1.0.
