[
https://issues.apache.org/jira/browse/HBASE-14158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953522#comment-14953522
]
Hadoop QA commented on HBASE-14158:
-----------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12766127/HBASE-14158.7.patch
against master branch at commit 587f5bc11f9d5d37557baf36c7df110af860a95c.
ATTACHMENT ID: 12766127
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+0 tests included{color}. The patch appears to be a
documentation patch that doesn't require tests.
{color:green}+1 hadoop versions{color}. The patch compiles with all
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.6.1 2.7.0
2.7.1)
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:green}+1 protoc{color}. The applied patch does not increase the
total number of protoc compiler warnings.
{color:green}+1 javadoc{color}. The javadoc tool did not generate any
warning messages.
{color:green}+1 checkstyle{color}. The applied patch does not increase the
total number of checkstyle errors
{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:red}-1 lineLengths{color}. The patch introduces the following lines
longer than 100:
+Apache Spark is a software framework that is used to process data in
memory in a distributed manner, and is replacing MapReduce in many use cases.
http://spark.apache.org/[Apache Spark]
Spark itself is out of scope of this document, please refer to the Spark site
for more information on the Spark project and subprojects. This document will
focus on 4 main interaction points between Spark and HBase. Those interaction
points are:
1.Basic Spark: The ability to have a HBase Connection at any point in your
Spark DAG.
2.Spark Streaming: The ability to have a HBase Connection at any point in your
Spark Streaming application.
3.Spark Bulk Load: The ability to write directly to HBase HFiles for bulk
insertion into HBase
4.SparkSQL/DataFrames: The ability to write SparkSQL that draws on tables that
are represented in HBase.
The following sections will walk through examples of all the interaction
points' just listed above.
+Here we will talk about Spark HBase integration at the lowest and simplest
levels. All the other interaction points are built upon the concepts that will
be described here.
At the root of all Spark and HBase integration is the HBaseContext. The
HBaseContext takes in HBase configurations and pushes them to the Spark
executors. This allows us to have an HBase Connection per Spark Executor in a
static location.
Just for reference Spark Executors can be on the same nodes as the Region
Servers or on different nodes there is no dependence of co-location. Think of
every Spark Executor as a multi-threaded client application.
This allows any Spark Tasks running on the executors to access the shared
Connection object.
Here is a simple example of how the HBaseContext can be used. In this example
we are doing a foreachPartition on a RDD in Scala.
----
+If Java is perferred instead of Scala it will look a little different but
still vary possible as we can see with this example.
+----
All functionality between Spark and HBase will be supported both in Scala and
in Java, with the exception of SparkSQL which will support any language that is
supported by Spark. For the remaining of this documentation we will focus on
Scala examples for now.
Now the examples above illustrate how to do a foreachPartition with a
connection. There are a number of other Spark base functions that are
supported out of the box:
1.BulkPut: For massively parallel sending of puts to HBase
2.BulkDelete: For massively parallel sending of deletes to HBase
3.BulkGet: For massively parallel sending of gets to HBase to create a new RDD
4.MapPartition: To do a Spark Map function with a Connection object to allow
full access to HBase
5.HBaseRDD: To simplify a distributed scan to create a RDD
Examples of all these functionalities can be found in the HBase-Spark Module.
== Spark Streaming
+http://spark.apache.org/streaming/[Spark Streaming] is a micro batching stream
processing framework built on top of Spark. HBase and Spark Streaming make
great companies in that HBase can help serve the following benefits alongside
Spark Streaming.
1.A place to grab reference data or profile data on the fly
2.A place to store counts or aggregates in a way that supports Spark Streaming
promise of only once processing.
The HBase-Spark moduleâs integration points with Spark Streaming are very
similar to its normal Spark integration points. In that the following commands
are possible straight off a Spark Streaming DStream.
1.BulkPut: For massively parallel sending of puts to HBase
2.BulkDelete: For massively parallel sending of deletes to HBase
3.BulkGet: For massively parallel sending of gets to HBase to create a new RDD
4.ForeachPartition: To do a Spark Foreach function with a Connection object to
allow full access to HBase
5.MapPartitions: To do a Spark Map function with a Connection object to allow
full access to HBase
Below is an example of bulkPut with DStreams, as you will see it is very close
in feel as the RDD bulk put.
+1. The hbaseContext that carries the configuration boardcast information link
us to the HBase Connections in the executors
+Spark bulk load follows very closely to the MapReduce implementation of bulk
load. In short there is a partitioner that partitions based on region splits
and the row keys are sent to the reducers in order so that HFiles can be
written out. In Spark terms the bulk load will be focused around a
repartitionAndSortWithinPartitions followed by a foreachPartition.
The only major difference with the Spark implementation compared to the
MapReduce implementation is that the column qualifier is included in the
shuffle ordering process. This was done because the MapReduce bulk load
implementation would have memory issues with loading rows with a large numbers
of columns. This memory issue was a result of the sorting of those columns
being done in the memory of the reducer JVM. Now that ordering is done in the
Spark Shuffle there should no longer be a limit to the number of columns in a
row for bulk loading.
Below is simple code of what bulk loading with Spark would look like
+2. A function that will convert a record in the RDD to a tuple key value par.
With the tuple key being a KeyFamilyQualifer object and the value being the
cell value. The KeyFamilyQualifer object will hold the RowKey, Column Family,
and Column Qualifier. The shuffle will partition on the RowKey but will sort
by all three values.
+Then following the Spark bulk load command we need to use the HBase's
LoadIncrementalHFiles object to load the newly created HFiles into HBase.
+Now there are advance options for bulk load with Spark. We can set the
following attributes with addition parameter options on hbaseBulkLoad.
{color:green}+1 site{color}. The mvn post-site goal succeeds with this patch.
{color:green}+1 core tests{color}. The patch passed unit tests in .
{color:red}-1 core zombie tests{color}. There are 1 zombie test(s):
at
org.apache.hadoop.hdfs.qjournal.client.TestQJMWithFaults.testRecoverAfterDoubleFailures(TestQJMWithFaults.java:147)
Test results:
https://builds.apache.org/job/PreCommit-HBASE-Build/15967//testReport/
Release Findbugs (version 2.0.3) warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/15967//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors:
https://builds.apache.org/job/PreCommit-HBASE-Build/15967//artifact/patchprocess/checkstyle-aggregate.html
Console output:
https://builds.apache.org/job/PreCommit-HBASE-Build/15967//console
This message is automatically generated.
> Add documentation for Initial Release for HBase-Spark Module integration
> -------------------------------------------------------------------------
>
> Key: HBASE-14158
> URL: https://issues.apache.org/jira/browse/HBASE-14158
> Project: HBase
> Issue Type: Improvement
> Components: documentation, spark
> Reporter: Ted Malaska
> Assignee: Ted Malaska
> Fix For: 2.0.0
>
> Attachments: HBASE-14158.1.patch, HBASE-14158.2.patch,
> HBASE-14158.5.patch, HBASE-14158.5.patch, HBASE-14158.6.patch,
> HBASE-14158.7.patch
>
>
> Add documentation for Initial Release for HBase-Spark Module integration
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)