Mike Percy has posted comments on this change.

Change subject: kudu flume sink blog post
......................................................................


Patch Set 1:

(27 comments)

Sorry for the delay, I was out of town for a while.

Because the date already passed, let's shoot for a publish date of 2016-07-08 
or 2016-07-12.

http://gerrit.cloudera.org:8080/#/c/3510/1/_posts/2016-07-06-flume.md
File _posts/2016-07-06-flume.md:

Line 3: title: "An Introduction to Kudu Flume Sink"
How about "Introduction to the Kudu Flume Sink"?


Line 10: There are many different ways of looking at Kudu. One way is to look 
at it as a tool which can be used to build system which are closer to 
_real-time_ processing of big data but without using _streaming_ software.
typo: s/system/systems/


Line 12: Traditionally in the Hadoop ecosystem we've dealt with various _batch 
processing_ technologies such as Map/Reduce and the many libraries and tools 
built on top of it in various languages (Apache Pig, Apache Hive, Apache Oozie 
and many other things). The main problem with this approach is that it needs to 
process the whole data set in batches, again and again, as soon as new data 
gets added. Things get really complicated when a few such tasks need to get 
chained together, or when the same data set needs to be processed in various 
ways by different jobs, while all compete for the shared cluster resources. The 
whole _orchestration_ becomes a nightmare over time. The opposite of this 
approach is _stream processing_: process the data as soon as it arrives, not in 
batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and 
many others make that possible. But writing streaming services is not trivial. 
The streaming systems are becoming more and more capab!
 le and support more complex constructs, but sometimes you just long for a good 
old database which you can simply store data inside and then query and write 
business logic on top. No slowness and primitiveness of batch processing, and 
no complexity of streaming. Something in between. That's what Kudu is, from 
this point of view. It's a scalable database, you can store big amounts of data 
in it with very impressive ingestion rates, enrich, delete and update that 
data, and generate views and reports. You can pretend it's a good old SQL 
database, but with scalability built in. The ability to use a real database 
instead of a bunch of files is quite empowering and leads to reduced 
complexity. Let's be honest, a bunch of files is not a database, and databases 
are popular for a reason: they enable us to write business logic on top of them 
with ease.
nit: Please wrap this line and all the others (except for code examples) to 100 
chars.


PS1, Line 14: Impala, Cloudera's
s/Impala, Cloudera's query engine/Apache Impala (incubating), a SQL query 
engine/

(Cloudera doesn't own Impala anymore; it's been donated to the ASF)


PS1, Line 14: quite similar to Accumulo
how about: quite similar to Accumulo (but, at the time of writing, without 
Accumulo's security features)


PS1, Line 18: As you can
nit: add a comma. "As you can see,"


PS1, Line 18: website
link the web site: http://flume.apache.org


PS1, Line 24: sources
For the sources sentence, consider replacing this with:

Several sources are provided out of the box with Flume, including a REST-based 
API, Avro and Thrift based APIs, a JMS connector, and sources that implement 
various ways to read files from disk.


PS1, Line 24: JDBC, Kafka and File
Let's not mention the JDBC channel, since it's not recommended for use.

How about: "other options such as Kafka- and File-based channels are also 
provided."


PS1, Line 24: Sinks
For the sinks part, how about saying:

Flume also ships with many sinks, including sinks to write data to HDFS, HBase, 
Hive, Kafka, as well as to other Flume agents.


PS1, Line 26: cloudera
Please use the following repository when linking: 
github.com/apache/incubator-kudu


Line 28: Configuring Kudu Flume Sink
Nit: Configuring the Kudu Flume Sink


Line 31: 
> Do you need to use the back-ticks to turn this into a codeblock or is the i
It looks like the indentation is sufficient.


PS1, Line 51: `lib` directory
Instead of the Flume lib directory, the jar file should be copied into the 
directory $FLUME_HOME/plugins.d/kudu-sink/lib. Details here: 
https://flume.apache.org/FlumeUserGuide.html#installing-third-party-plugins


PS1, Line 51: show case
nit: showcase


PS1, Line 53: The minimum configuration for KuduSink
At a minimum, the KuduSink


PS1, Line 53: Kudu Flume Sink
nit: The Kudu Flume Sink


Line 59: Parameter Name      | Default                                       | 
Description
> Consider formatting this as a real table or putting it into a codeblock so 
This renders as a table for me when viewed through Jekyll. It would be nice if 
the font size was a little smaller but that's not something that can be 
controlled through the Markdown.


Line 70:     public class SimpleKuduEventProducer implements KuduEventProducer {
> Again, make sure this ends up in a codeblock.
This renders as code for me.


PS1, Line 135: holds
which will hold


PS1, Line 135: payload
Use double-quotes for "payload", not single-quotes


PS1, Line 135: vmstat
from the `vmstat` command.


PS1, Line 135: KuduSink
the KuduSink


PS1, Line 135: KuduSink
"by the KuduSink."


PS1, Line 141: are
lets us


PS1, Line 141: seconds
per second


PS1, Line 141: create data lakes
create a data lake


-- 
To view, visit http://gerrit.cloudera.org:8080/3510
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I810146ab24c88bc6cc562d81746b9bf5303396ed
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: gh-pages
Gerrit-Owner: Ara Ebrahimi <ara.ebrah...@argyledata.com>
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Misty Stanley-Jones <mi...@apache.org>
Gerrit-HasComments: Yes

Reply via email to