NicoK commented on a change in pull request #9210: [FLINK-12746][docs] Getting 
Started - DataStream Example Walkthrough

 File path: docs/getting-started/walkthroughs/
 @@ -0,0 +1,897 @@
+title: "DataStream API"
+nav-id: datastreamwalkthrough
+nav-title: 'DataStream API'
+nav-parent_id: walkthroughs
+nav-pos: 2
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+Apache Flink offers a DataStream API for building robust, stateful streaming 
+It provides fine-grained control over state and time, which allows for the 
implementation of complex event-driven systems.
+* This will be replaced by the TOC
+## What Are You Building? 
+Credit card fraud is a growing concern in the digital age.
+Criminals steal credit card numbers by running scams or hacking into insecure 
+Stolen numbers are tested by making one or more small purchases, often for a 
dollar or less.
+If that works, they then make more significant purchases to get items they can 
sell or keep for themselves.
+In this tutorial, you will build a fraud detection system for alerting on 
suspicious credit card transactions.
+Using a simple set of rules, you will see how Flink allows us to implement 
advanced business logic and act in real-time.
+## Prerequisites
+This walkthrough assumes that you have some familiarity with Java or Scala, 
but you should be able to follow along even if you are coming from a different 
programming language.
+## Help, I’m Stuck! 
+If you get stuck, check out the [community support 
+In particular, Apache Flink's [user mailing 
list]( is consistently 
ranked as one of the most active of any Apache project and a great way to get 
help quickly.
+## How To Follow Along
+If you want to follow along, you will require a computer with:
+* Java 8 
+* Maven 
+A provided Flink Maven Archetype will create a skeleton project with all the 
necessary dependencies quickly:
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight bash %}
+$ mvn archetype:generate \
+    -DarchetypeGroupId=org.apache.flink \
+    -DarchetypeArtifactId=flink-walkthrough-datastream-java \{% unless 
site.is_stable %}
 \{% endunless %}
+    -DarchetypeVersion={{ site.version }} \
+    -DgroupId=frauddetection \
+    -DartifactId=frauddetection \
+    -Dversion=0.1 \
+    -Dpackage=spendreport \
+    -DinteractiveMode=false
+{% endhighlight %}
+<div data-lang="scala" markdown="1">
+{% highlight bash %}
+$ mvn archetype:generate \
+    -DarchetypeGroupId=org.apache.flink \
+    -DarchetypeArtifactId=flink-walkthrough-datastream-scala \{% unless 
site.is_stable %}
 \{% endunless %}
+    -DarchetypeVersion={{ site.version }} \
+    -DgroupId=frauddetection \
+    -DartifactId=frauddetection \
+    -Dversion=0.1 \
+    -Dpackage=spendreport \
+    -DinteractiveMode=false
+{% endhighlight %}
+{% unless site.is_stable %}
+<p style="border-radius: 5px; padding: 5px" class="bg-danger">
+    <b>Note</b>: For Maven 3.0 or higher, it is no longer possible to specify 
the repository (-DarchetypeCatalog) via the commandline. If you wish to use the 
snapshot repository, you need to add a repository entry to your settings.xml. 
For details about this change, please refer to <a 
 official document</a>
+{% endunless %}
+You can edit the `groupId`, `artifactId` and `package` if you like. With the 
above parameters,
+Maven will create a project with all the dependencies to complete this 
+After importing the project into your editor, you will see a file with the 
following code which you can run directly inside your IDE.
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight java %}
+package frauddetection;
+import org.apache.flink.api.common.state.ValueState;
+import org.apache.flink.api.common.state.ValueStateDescriptor;
+import org.apache.flink.api.common.typeinfo.Types;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
+import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
+import org.apache.flink.util.Collector;
+import org.apache.flink.walkthrough.common.entity.Alert;
+import org.apache.flink.walkthrough.common.entity.Transaction;
+import org.apache.flink.walkthrough.common.source.TransactionSource;
+public class FraudDetectionJob {
+    public static void main(String[] args) throws Exception {
+        StreamExecutionEnvironment env = 
+        DataStream<Transaction> transactions = env
+            .addSource(new TransactionSource())
+            .name("transactions");
+        DataStream<Alerts> alerts = transactions
+            .keyBy(Transaction::getAccountId)
+            .process(new FraudDetector())
+            .name("fraud-detector");
+        alerts
+            .addSink(new AlertSink())
+            .name("send-alerts");
+        env.execute("Fraud Detection");
+    }
+{% endhighlight %}
+{% highlight java %}
+public class FraudDetector extends KeyedProcessFunction<Long, Transaction, 
Alert> {
+    public static final double SMALL_AMOUNT = 0.01;
+    public static final double LARGE_AMOUNT = 500.00;
+    public static final long ONE_DAY = 24 * 60 * 60 * 1000;
+    @Override
+    public void processElement(
+        Transaction transaction,
+        Context context,
+        Collector<Alert> collector) throws Exception {
+        Alert alert = new Alert();
+        alert.setId(transaction.getAccountId());
+        collector.collect(alert);
+    }
+{% endhighlight %}
+<div data-lang="scala" markdown="1">
+#### FraudDetectionJob.scala
+{% highlight scala %}
+package frauddetection
+import org.apache.flink.api.common.state.ValueState
+import org.apache.flink.api.common.state.ValueStateDescriptor
+import org.apache.flink.api.common.typeinfo.Types
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.streaming.api.functions.KeyedProcessFunction
+import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction
+import org.apache.flink.util.Collector
+import org.apache.flink.walkthrough.common.entity.Alert
+import org.apache.flink.walkthrough.common.entity.Transaction
+import org.apache.flink.walkthrough.common.source.TransactionSource
+object FraudDetectionJob {
+    def main(args: Array[String]): Unit = {
+        val env = StreamExecutionEnvironment.getExecutionEnvironment
+        val transactions = env
+            .addSource(new TransactionSource)
+            .name("transactions")
+        val alerts = transactions
+            .keyBy(transaction => transaction.getAccountId)
+            .process(new FraudDetector)
+            .name("fraud-detector")
+        alerts
+            .addSink(new AlertSink)
+            .name("send-alerts")
+        env.execute("Fraud Detection")
+    }
+{% endhighlight %}
+#### FraudDetector.scala
+{% highlight scala %}
+object FraudDetector {
+    val SMALL_AMOUNT = 0.01
+    val LARGE_AMOUNT = 500.00
+    val ONE_DAY = 24 * 60 * 60 * 1000L
+class FraudDetector extends KeyedProcessFunction[Long, Transaction, Alert] {
+    override def processElement(
+        transaction: Transaction,
+        context: Context,
+        collector: Collector[Alert]): Unit = {
+        Alert alert = new Alert
+        alert.setId(transaction.getAccountId)
+        collector.collect(alert)
+    }
+{% endhighlight %}
+## Breaking Down The Code
+#### The Execution Environment
+The first line sets up your `StreamExecutionEnvironment`.
+The execution environment is how you can set properties for your Job and 
create your sources.
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight java %}
+StreamExecutionEnvironment env = 
+{% endhighlight %}
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+val env = StreamExecutionEnvironment.getExecutionEnvironment
+{% endhighlight %}
+#### Creating A Source
+Sources define connections to external systems that Flink can use to consume 
data such as Apache Kafka, Rabbit MQ, or Apache Pulsar.
+This walkthrough uses a source that generates an infinite stream of credit 
card transactions for you to process.
+Each transaction contains an account ID (`accountId`), timestamp (`timestamp`) 
of when the transaction occurred, and US$ amount (`amount`).
+The `name` attached to the source is just for debugging purposes, so if 
something goes wrong, we will know where the error originated.
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight java %}
+DataStream<Transaction> transactions = env
+    .addSource(new TransactionSource())
+    .name("transactions")
+{% endhighlight %}
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+val transactions = env
+    .addSource(new TransactionSource)
+    .name("transactions")
+{% endhighlight %}
+#### Partitioning Events & Detecting Fraud
+The stream contains transactions from a large number of users; however, fraud 
occurs on a per-account basis. To detect fraud, you must ensure that the same 
instance of the fraud detector processes every event for a given account.
+Streams can be partitioned using DataStream#keyBy to ensure that the same 
physical operator processes all records for a particular key.
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight java %}
+DataStream<Alerts> alerts = transactions
+    .keyBy(Transaction::getAccountId)
+    .process(new FraudDetector())
+    .name("fraud-detector");
+{% endhighlight %}
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+val alerts = transactions
+    .keyBy(transaction => transaction.getAccountId)
+    .process(new FraudDetector)
+    .name("fraud-detector")
+{% endhighlight %}
+#### Outputting Results
+Sink's connect Flink Jobs to external systems to output events; such as Apache 
Kafka, Casandra, and AWS Kinesis.
+The `AlertSink` logs each alert with log level **INFO**, instead of writing to 
persistent storage, so you can easily see your results.
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight java %}
+alerts.addSink(new AlertSink());
+{% endhighlight %}
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+alerts.addSink(new AlertSink)
+{% endhighlight %}
+#### Executing The Job
+Flink applications are built lazily and shipped to the cluster for execution 
only once fully formed.
+You can call `StreamExecutionEnvironment#execute` to begin the execution of 
our Job by giving it a name.
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight java %}
+env.execute("Fraud Detection");
+{% endhighlight %}
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+env.execute("Fraud Detection")
+{% endhighlight %}
+#### The Fraud Detector
+The logic for the fraud detector is encapsulated within a 
+This first version outputs an alert on every transaction, which some may say 
is overly conservative.
+It also includes several constants that you may find helpful as you work 
through your implementation.
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight java %}
+public class FraudDetector extends KeyedProcessFunction<Long, Transaction, 
Alert> {
+    public static final double SMALL_AMOUNT = 0.01;
+    public static final double LARGE_AMOUNT = 500.00;
+    public static final long ONE_DAY = 24 * 60 * 60 * 1000;
+    @Override
+    public void processElement(
+        Transaction transaction,
+        Context context,
+        Collector<Alert> collector) throws Exception {
+        Alert alert = new Alert();
+        alert.setId(transaction.getAccountId());
+        collector.collect(alert);
+    }
+{% endhighlight %}
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+object FraudDetector {
+    val SMALL_AMOUNT = 0.01
+    val LARGE_AMOUNT = 500.00
+    val ONE_DAY = 24 * 60 * 60 * 1000L
+class FraudDetector extends KeyedProcessFunction[Long, Transaction, Alert] {
+    override def processElement(
+        transaction: Transaction,
+        context: Context,
+        collector: Collector[Alert]): Unit = {
+        Alert alert = new Alert
+        alert.setId(transaction.getAccountId)
+        collector.collect(alert)
+    }
+{% endhighlight %}
+## Writing An Initial Application 
+For the initial implementation, the fraud detector should output an alert for 
any account that makes a small transaction immediately followed by a large one. 
Where small is anything less than $0.10 and large is more than $500.
+Imagine your fraud detector processes the following stream of transactions for 
a particular account.
+<p class="text-center">
+    <img alt="Transactions" width="80%" src="{{ site.baseurl 
+Transactions 3 and 4 should be marked as fraudulent because it is a small 
transaction, $0.09, followed by a large one, $510.
+Alternatively, transactions 7, 8, and 9 are not fraud because the small amount 
of $0.02 is not immediately followed by the large one; instead, there is an 
intermediate transaction that breaks the pattern.
+To do this, the fraud detector must _remember_ information across events; a 
large transaction is only fraudulent if the previous one was small.
+Remembering information across events requires [state]({{ site.baseurl 
}}/concepts/glossary.html#managed-state) and so you will implement your fraud 
detector using a [KeyedProcessFunction]({{ site.baseurl 
}}/dev/stream/operators/process_function.html), which provides fine-grained 
control over state and time.
+The most straightforward implementation would be to set a flag whenever a 
small transaction is processed.
+This way, when a large transaction comes through, you can check if the flag is 
set for that account.
+If it is, then this is fraud and output an alert.
+This flag is what you want to store in Flink state.
+The most basic type of state in Flink is [ValueState]({{ site.baseurl 
}}/dev/stream/state/state.html#using-managed-keyed-state), a data type that 
provides _fault tolerant_, _managed_, _per key_ state.
+`ValueState` is created using a `ValueStateDescriptor` which contains metadata 
about how it should be managed.
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight java %}
+public class FraudDetector extends KeyedProcessFunction<Long, Transaction, 
Alert> {
+       public static final double SMALL_AMOUNT = 0.01;
+       public static final double LARGE_AMOUNT = 500.00;
+       public static final long ONE_DAY = 24 * 60 * 60 * 1000;
+       private transient ValueState<Boolean> flagState;
+       @Override
+       public void open(Configuration parameters) throws Exception {
+               ValueStateDescriptor<Boolean> flagDescriptor = new 
+                               "flag",
+                               Types.BOOLEAN);
+               flagState = getRuntimeContext().getState(flagDescriptor);
+       }
+{% endhighlight %}
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+class FraudDetector extends KeyedProcessFunction[Long, Transaction, Alert] {
+    @transient private var flagState: ValueState[java.lang.Boolean] = _
+    @throws[Exception]
+    override def open(parameters: Configuration): Unit = {
+        val flagDescriptor = new ValueStateDescriptor("flag", Types.BOOLEAN)
+        flagState = getRuntimeContext.getState(flagDescriptor)
+    }
+{% endhighlight %}
+`ValueState` is a wrapper class, similar to `AtomicReference` in the Java 
standard library.
+It provides three methods for interacting with its contents; `update` sets the 
state, `value` gets the current value, and `clear` to delete its contents.
+Otherwise, fault tolerance is managed automatically under the hood, and so you 
can interact with it like any standard variable.
+Below, you can see an example of how you can use a flag state to track 
potential fraudulent transactions.
+<div class="codetabs" markdown="1">
+<div data-lang="java" markdown="1">
+{% highlight java %}
+    @Override
+    public void processElement(
+        Transaction transaction,
+        Context context,
+        Collector<Alert> collector) throws Exception {
+        // Get the current state for the current key
+        Boolean lastTransactionWasSmall = flagState.value();
+        if (lastTransactionWasSmall != null) {
+            if (transaction.getAmount() > LARGE_AMOUNT) {
+                //Output an alert downstream
+                Alert alert = new Alert();
+                alert.setId(transaction.getAccountId());
+                collector.collect(alert);            
+            }
+        }
+        // clean up our state
+        flagState.clear();
+        if (transaction.getAmount() < SMALL_AMOUNT) {
+            // set the flag to true
+            flagState.update(true);
+        }
+    }
+{% endhighlight %}
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+    override def processElement(
+        transaction: Transaction,
+        context: Context,
+        collector: Collector[Alert]): Unit = {
+        // Get the current state for the current key
+        val lastTransactionWasSmall = flagState.value
+        if (lastTransactionWasSmall != null) {
+            if (transaction.getAmount() > LARGE_AMOUNT) {
+                //Output an alert downstream
+                Alert alert = new Alert
+                alert.setId(transaction.getAccountId)
+                collector.collect(alert)
+            }
+        }
+        // clean up our state
+        flagState.clear()
+        if (transaction.getAmount() < SMALL_AMOUNT) {
+            // set the flag to true
+            flagState.update(true)
+        }
+    }
+{% endhighlight %}
+For every transaction, the fraud detector checks the state of the flag for 
that account.
+Remember, `ValueState` is always scoped to the current key.
+If the flag is non-null, then the last transaction seen for that key was 
small, and so if the amount for this transaction is large, then the detector 
outputs a fraud alert.
 Review comment:
   Maybe start here to explain what `null` means here and how we use it.

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

Reply via email to