Mike Percy has posted comments on this change. Change subject: kudu flume sink blog post ......................................................................
Patch Set 1: (27 comments) Sorry for the delay, I was out of town for a while. Because the date already passed, let's shoot for a publish date of 2016-07-08 or 2016-07-12. http://gerrit.cloudera.org:8080/#/c/3510/1/_posts/2016-07-06-flume.md File _posts/2016-07-06-flume.md: Line 3: title: "An Introduction to Kudu Flume Sink" How about "Introduction to the Kudu Flume Sink"? Line 10: There are many different ways of looking at Kudu. One way is to look at it as a tool which can be used to build system which are closer to _real-time_ processing of big data but without using _streaming_ software. typo: s/system/systems/ Line 12: Traditionally in the Hadoop ecosystem we've dealt with various _batch processing_ technologies such as Map/Reduce and the many libraries and tools built on top of it in various languages (Apache Pig, Apache Hive, Apache Oozie and many other things). The main problem with this approach is that it needs to process the whole data set in batches, again and again, as soon as new data gets added. Things get really complicated when a few such tasks need to get chained together, or when the same data set needs to be processed in various ways by different jobs, while all compete for the shared cluster resources. The whole _orchestration_ becomes a nightmare over time. The opposite of this approach is _stream processing_: process the data as soon as it arrives, not in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make that possible. But writing streaming services is not trivial. The streaming systems are becoming more and more capab! le and support more complex constructs, but sometimes you just long for a good old database which you can simply store data inside and then query and write business logic on top. No slowness and primitiveness of batch processing, and no complexity of streaming. Something in between. That's what Kudu is, from this point of view. It's a scalable database, you can store big amounts of data in it with very impressive ingestion rates, enrich, delete and update that data, and generate views and reports. You can pretend it's a good old SQL database, but with scalability built in. The ability to use a real database instead of a bunch of files is quite empowering and leads to reduced complexity. Let's be honest, a bunch of files is not a database, and databases are popular for a reason: they enable us to write business logic on top of them with ease. nit: Please wrap this line and all the others (except for code examples) to 100 chars. PS1, Line 14: Impala, Cloudera's s/Impala, Cloudera's query engine/Apache Impala (incubating), a SQL query engine/ (Cloudera doesn't own Impala anymore; it's been donated to the ASF) PS1, Line 14: quite similar to Accumulo how about: quite similar to Accumulo (but, at the time of writing, without Accumulo's security features) PS1, Line 18: As you can nit: add a comma. "As you can see," PS1, Line 18: website link the web site: http://flume.apache.org PS1, Line 24: sources For the sources sentence, consider replacing this with: Several sources are provided out of the box with Flume, including a REST-based API, Avro and Thrift based APIs, a JMS connector, and sources that implement various ways to read files from disk. PS1, Line 24: JDBC, Kafka and File Let's not mention the JDBC channel, since it's not recommended for use. How about: "other options such as Kafka- and File-based channels are also provided." PS1, Line 24: Sinks For the sinks part, how about saying: Flume also ships with many sinks, including sinks to write data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents. PS1, Line 26: cloudera Please use the following repository when linking: github.com/apache/incubator-kudu Line 28: Configuring Kudu Flume Sink Nit: Configuring the Kudu Flume Sink Line 31: > Do you need to use the back-ticks to turn this into a codeblock or is the i It looks like the indentation is sufficient. PS1, Line 51: `lib` directory Instead of the Flume lib directory, the jar file should be copied into the directory $FLUME_HOME/plugins.d/kudu-sink/lib. Details here: https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins PS1, Line 51: show case nit: showcase PS1, Line 53: The minimum configuration for KuduSink At a minimum, the KuduSink PS1, Line 53: Kudu Flume Sink nit: The Kudu Flume Sink Line 59: Parameter Name | Default | Description > Consider formatting this as a real table or putting it into a codeblock so This renders as a table for me when viewed through Jekyll. It would be nice if the font size was a little smaller but that's not something that can be controlled through the Markdown. Line 70: public class SimpleKuduEventProducer implements KuduEventProducer { > Again, make sure this ends up in a codeblock. This renders as code for me. PS1, Line 135: holds which will hold PS1, Line 135: payload Use double-quotes for "payload", not single-quotes PS1, Line 135: vmstat from the `vmstat` command. PS1, Line 135: KuduSink the KuduSink PS1, Line 135: KuduSink "by the KuduSink." PS1, Line 141: are lets us PS1, Line 141: seconds per second PS1, Line 141: create data lakes create a data lake -- To view, visit http://gerrit.cloudera.org:8080/3510 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I810146ab24c88bc6cc562d81746b9bf5303396ed Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: gh-pages Gerrit-Owner: Ara Ebrahimi <ara.ebrah...@argyledata.com> Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Misty Stanley-Jones <mi...@apache.org> Gerrit-HasComments: Yes