[GitHub] flink pull request #4696: [FLINK-7632] [document] Overhaul on Cassandra conn...

mcfongtw Fri, 22 Sep 2017 17:38:14 -0700

Github user mcfongtw commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4696#discussion_r140620313
  
    --- Diff: docs/dev/connectors/cassandra.md ---
    @@ -78,76 +96,195 @@ Note that that enabling this feature will have an 
adverse impact on latency.
     
     <p style="border-radius: 5px; padding: 5px" class="bg-danger"><b>Note</b>: 
The write-ahead log functionality is currently experimental. In many cases it 
is sufficent to use the connector without enabling it. Please report problems 
to the development mailing list.</p>
     
    +### Checkpointing and Fault Tolerance
    +With checkpointing enabled, Cassandra Sink guarantees at-least-once 
delivery of action requests to C* instance.
     
    -#### Example
    +<p style="border-radius: 5px; padding: 5px" 
class="bg-danger"><b>Note</b>:However, current Cassandra Sink implementation 
does not flush the pending mutations  before the checkpoint was triggered. 
Thus, some in-flight mutations might not be replayed when the job recovered. 
</p>
    +
    +More details on [checkpoints docs]({{ site.baseurl 
}}/dev/stream/state/checkpointing.html) and [fault tolerance guarantee docs]({{ 
site.baseurl }}/dev/connectors/guarantees.html)
    +
    +To enable fault tolerant guarantee, checkpointing of the topology needs to 
be enabled at the execution environment:
     
     <div class="codetabs" markdown="1">
     <div data-lang="java" markdown="1">
     {% highlight java %}
    -CassandraSink.addSink(input)
    -  .setQuery("INSERT INTO example.values (id, counter) values (?, ?);")
    -  .setClusterBuilder(new ClusterBuilder() {
    -    @Override
    -    public Cluster buildCluster(Cluster.Builder builder) {
    -      return builder.addContactPoint("127.0.0.1").build();
    -    }
    -  })
    -  .build();
    +final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
    +env.enableCheckpointing(5000); // checkpoint every 5000 msecs
     {% endhighlight %}
     </div>
     <div data-lang="scala" markdown="1">
     {% highlight scala %}
    -CassandraSink.addSink(input)
    -  .setQuery("INSERT INTO example.values (id, counter) values (?, ?);")
    -  .setClusterBuilder(new ClusterBuilder() {
    -    override def buildCluster(builder: Cluster.Builder): Cluster = {
    -      builder.addContactPoint("127.0.0.1").build()
    -    }
    -  })
    -  .build()
    +val env = StreamExecutionEnvironment.getExecutionEnvironment()
    +env.enableCheckpointing(5000) // checkpoint every 5000 msecs
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +## Examples
    +
    +The Cassandra sinks currently support both Java Tuple and POJO data types, 
and Flink automatically detects which type of input is used. For general use 
case of those streaming data type, please refer to [Supported Data Types]({{ 
site.baseurl }}/dev/api_concepts.html). We show two implementations based on 
[SocketWindowWordCount](https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/socket/SocketWindowWordCount.java),
 for Pojo and Java Tuple data types respectively.
    +
    +In all these examples, we assumed the associated Keyspace `example` and 
Table `wordcount` have been created.
    +
    +<div class="codetabs" markdown="1">
    +<div data-lang="CQL" markdown="1">
    +{% highlight sql %}
    +CREATE KEYSPACE IF NOT EXISTS example
    +    WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 
'1'};
    +CREATE TABLE IF NOT EXISTS example.wordcount (
    +    word text,
    +    count bigint,
    +    PRIMARY KEY(word)
    +    );
    +{% endhighlight %}
    +</div>
    +</div>
    +
    +### Cassandra Sink Example for Streaming Java Tuple Data Type
    +While storing the result with Java Tuple data type to a Cassandra sink, it 
is required to set a CQL upsert statement (via setQuery('stmt')) to persist 
each record back to database. With the upsert query cached as 
`PreparedStatement`, Cassandra connector internally converts each Tuple 
elements as parameters to the statement.
    +
    +For details about `PreparedStatement` and `BoundStatement`, please visit 
[DataStax Java Driver 
manual](https://docs.datastax.com/en/developer/java-driver/2.1/manual/statements/prepared/)
    +
    +Please note that if the upsert query were not set, an 
`IllegalArgumentException` would be thrown with the following error message 
`Query must not be null or empty.`
    +
    +<div class="codetabs" markdown="1">
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +// get the execution environment
    +final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
    +
    +// get input data by connecting to the socket
    +DataStream<String> text = env.socketTextStream(hostname, port, "\n");
    +
    +// parse the data, group it, window it, and aggregate the counts
    +DataStream<Tuple2<String, Long>> result = text
    +
    +        .flatMap(new FlatMapFunction<String, Tuple2<String, Long>>() {
    +            @Override
    +            public void flatMap(String value, Collector<Tuple2<String, 
Long>> out) {
    +                // normalize and split the line
    +                String[] words = value.toLowerCase().split("\\W+");
    +
    +                // emit the pairs
    +                for (String word : words) {
    +                    //Do not accept empty word, since word is defined as 
primary key in C* table
    +                    if (!word.isEmpty()) {
    +                        out.collect(new Tuple2<String, Long>(word, 1L));
    +                    }
    +                }
    +            }
    +        })
    +
    +        .keyBy(0)
    +        .timeWindow(Time.seconds(5))
    +        .sum(1)
    +        ;
    +
    +CassandraSink.addSink(result)
    +        .setQuery("INSERT INTO example.wordcount(word, count) values (?, 
?);")
    +        .setHost("127.0.0.1")
    +        .build();
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +CassandraSink.addSink(input.javaStream)
    +    .setQuery("INSERT INTO test.wordcount(word, count) values (?, ?);")
    +    .setClusterBuilder(new ClusterBuilder() {
    +        @Override
    +        def buildCluster(builder: Cluster.Builder): Cluster = {
    +            builder.addContactPoint("127.0.0.1").build()
    +        }
    +    })
    +    .build()
     {% endhighlight %}
     </div>
     </div>
     
    -The Cassandra sinks support both tuples and POJO's that use DataStax 
annotations.
    -Flink automatically detects which type of input is used.
     
    -Example for such a Pojo:
    +### Cassandra Sink Example for Streaming POJO Data Type
    +An example of streaming a POJO data type and store the same POJO entity 
back to Cassandra. In addition, this POJO implementation needs to follow 
[DataStax Java Driver 
Manual](http://docs.datastax.com/en/developer/java-driver/2.1/manual/object_mapper/creating/)
 to annotate the class as Cassandra connector internally maps each field of 
this entity to an associated column of the desginated Table using 
`com.datastax.driver.mapping.Mapper` class of DataStax Java Driver.
    +
    +Please note that if the upsert query was set, an 
`IllegalArgumentException` would be thrown with the following error message 
`Specifying a query is not allowed when using a Pojo-Stream as input.`
    +
    +For each CQL defined data type for columns, please refer to [CQL 
Documentation](https://docs.datastax.com/en/cql/3.1/cql/cql_reference/cql_data_types_c.html)
    --- End diff --
    
    Will rephrase. Thank you.

---

[GitHub] flink pull request #4696: [FLINK-7632] [document] Overhaul on Cassandra conn...

Reply via email to