[
https://issues.apache.org/jira/browse/FLINK-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185070#comment-15185070
]
ASF GitHub Bot commented on FLINK-3591:
---------------------------------------
Github user vasia commented on a diff in the pull request:
https://github.com/apache/flink/pull/1774#discussion_r55374388
--- Diff: docs/quickstart/run_example_quickstart.md ---
@@ -27,116 +27,360 @@ under the License.
* This will be replaced by the TOC
{:toc}
-This guide walks you through the steps of executing an example program
([K-Means clustering](http://en.wikipedia.org/wiki/K-means_clustering)) on
Flink.
-On the way, you will see the a visualization of the program, the optimized
execution plan, and track the progress of its execution.
+In this guide we will start from scratch and fo from setting up a Flink
project and running
+a streaming analysis program on a Flink cluster.
+
+Wikipedia provides an IRC channel where all edits to the wiki are logged.
We are going to
+read this channel in Flink and count the number of bytes that each user
edits within
+a given window of time. This is easy enough to implement in a few minutes
using Flink but it will
+give you a good foundation from which to start building more complex
analysis programs on your own.
+
+## Setting up a Maven Project
+
+We are going to use a Flink Maven Archetype for creating our project
stucture. Please
+see [Java API Quickstart]({{ site.baseurl
}}/quickstart/java_api_quickstart.html) for more details
+about this. For our purposes, the command to run is this:
+
+{% highlight bash %}
+$ mvn archetype:generate\
+ -DarchetypeGroupId=org.apache.flink\
+ -DarchetypeArtifactId=flink-quickstart-java\
+ -DarchetypeVersion=1.0.0\
+ -DgroupId=wiki-edits\
+ -DartifactId=wiki-edits\
+ -Dversion=0.1\
+ -Dpackage=wikiedits\
+ -DinteractiveMode=false\
+{% endhighlight %}
+
+You can edit the `groupId`, `artifactId` and `package` if you like. With
the above parameters,
+maven will create a project structure that looks like this:
+
+{% highlight bash %}
+$ tree wiki-edits
+wiki-edits/
+├── pom.xml
+└── src
+ └── main
+ ├── java
+ │ └── wikiedits
+ │ ├── Job.java
+ │ ├── SocketTextStreamWordCount.java
+ │ └── WordCount.java
+ └── resources
+ └── log4j.properties
+{% endhighlight %}
+
+There is our `pom.xml` file that already has the Flink dependencies added
in the root directory and
+several example Flink programs in `src/main/java`. We can delete the
example programs, since
+we are going to start from scratch:
+
+{% highlight bash %}
+$ rm wiki-edits/src/main/java/wikiedits/*.java
+{% endhighlight %}
+
+As a last step we need to add the Flink wikipedia connector as a
dependency so that we can
+use it in our program. Edit the `dependencies` section so that it looks
like this:
+
+{% highlight xml %}
+<dependencies>
+ <dependency>
+ <groupId>org.apache.flink</groupId>
+ <artifactId>flink-java</artifactId>
+ <version>${flink.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.flink</groupId>
+ <artifactId>flink-streaming-java_2.10</artifactId>
+ <version>${flink.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.flink</groupId>
+ <artifactId>flink-clients_2.10</artifactId>
+ <version>${flink.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.flink</groupId>
+ <artifactId>flink-connector-wikiedits_2.10</artifactId>
+ <version>${flink.version}</version>
+ </dependency>
+</dependencies>
+{% endhighlight %}
+
+Notice the `flink-connector-wikiedits_2.10` dependency that was added.
+
+## Writing a Flink Program
+
+It's coding time. Fire up your favorite IDE and import the Maven project
or open a text editor and
+create the file `src/main/wikiedits/WikipediaAnalysis.java`:
+
+{% highlight java %}
+package wikiedits;
+
+public class WikipediaAnalysis {
+
+ public static void main(String[] args) throws Exception {
+
+ }
+}
+{% endhighlight %}
+
+I admit it's very bare bones now but we will fill it as we go. Note, that
I'll not give
+import statements here since IDEs can add them automatically. At the end
of this section I'll show
+the complete code with import statements if you simply want to skip ahead
and enter that in your
+editor.
+
+The first step in a Flink program is to create a
`StreamExecutionEnvironment`
+(or `ExecutionEnvironment` if you are writing a batch job). This can be
used to set execution
+parameters and create sources for reading from external systems. So let's
go ahead, add
+this to the main method:
+
+{% highlight java %}
+StreamExecutionEnvironment see =
StreamExecutionEnvironment.getExecutionEnvironment();
+{% endhighlight %}
+
+Next, we will create a source that reads from the Wikipedia IRC log:
+
+{% highlight java %}
+DataStream<WikipediaEditEvent> edits = see.addSource(new
WikipediaEditsSource());
+{% endhighlight %}
+
+This creates a `DataStream` of `WikipediaEditEvent` elements that we can
further process. For
+the purposes of this example we are interrested in determining the number
of added or removed
+bytes that each user causes in a certain time window, let's say five
seconds. For this we first
+have to specify that we want to key the stream on the user name, that is
to say that operations
+on this should take the key into account. In our case the summation of
edited bytes in the windows
+should be per unique user. For keying a Stream we have to provide a
`KeySelector`, like this:
+
+{% highlight java %}
+KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
+ .keyBy(new KeySelector<WikipediaEditEvent, String>() {
+ @Override
+ public String getKey(WikipediaEditEvent event) {
+ return event.getUser();
+ }
+ });
+{% endhighlight %}
+
+This gives us the same Stream of `WikipediaEditEvent` that has a `String`
key that is the user.
+We can now specify that we want to have windows imposed on this stream and
compute some
+result based on elements in these windows. We need to specify a window
here because we are
+dealing with an infinite stream of events. If you want to compute an
aggregation on such an
+infinite stream you never know when you are finished. That's where windows
come into play,
+they specify a time slice in which we should perform our computation. In
our example we will say
--- End diff --
Then how about a single sentence instead of 3?
"We need to specify a window here because we are
dealing with an infinite stream of events. If you want to compute an
aggregation on such an
infinite stream you never know when you are finished. That's where windows
come into play,
they specify a time slice in which we should perform our computation" ==> I
would basically keep the last sentence only.
> Replace Quickstart K-Means Example by Streaming Example
> -------------------------------------------------------
>
> Key: FLINK-3591
> URL: https://issues.apache.org/jira/browse/FLINK-3591
> Project: Flink
> Issue Type: Improvement
> Components: Documentation
> Reporter: Aljoscha Krettek
> Assignee: Aljoscha Krettek
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)