[ 
https://issues.apache.org/jira/browse/HBASE-14158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953522#comment-14953522
 ] 

Hadoop QA commented on HBASE-14158:
-----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12766127/HBASE-14158.7.patch
  against master branch at commit 587f5bc11f9d5d37557baf36c7df110af860a95c.
  ATTACHMENT ID: 12766127

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:green}+0 tests included{color}.  The patch appears to be a 
documentation patch that doesn't require tests.

    {color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.6.1 2.7.0 
2.7.1)

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

    {color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

    {color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
    +Apache Spark is a software framework that is used to process data in 
memory in a distributed manner, and is replacing MapReduce in many use cases.  

http://spark.apache.org/[Apache Spark]

Spark itself is out of scope of this document, please refer to the Spark site 
for more information on the Spark project and subprojects.  This document will 
focus on 4 main interaction points between Spark and HBase.  Those interaction 
points are:

1.Basic Spark: The ability to have a HBase Connection at any point in your 
Spark DAG.
2.Spark Streaming: The ability to have a HBase Connection at any point in your 
Spark Streaming application.
3.Spark Bulk Load: The ability to write directly to HBase HFiles for bulk 
insertion into HBase
4.SparkSQL/DataFrames: The ability to write SparkSQL that draws on tables that 
are represented in HBase.  

The following sections will walk through examples of all the interaction 
points' just listed above.
+Here we will talk about Spark HBase integration at the lowest and simplest 
levels.  All the other interaction points are built upon the concepts that will 
be described here.  

At the root of all Spark and HBase integration is the HBaseContext.  The 
HBaseContext takes in HBase configurations and pushes them to the Spark 
executors.  This allows us to have an HBase Connection per Spark Executor in a 
static location.

Just for reference Spark Executors can be on the same nodes as the Region 
Servers or on different nodes there is no dependence of co-location.  Think of 
every Spark Executor as a multi-threaded client application.

This allows any Spark Tasks running on the executors to access the shared 
Connection object.

Here is a simple example of how the HBaseContext can be used.  In this example 
we are doing a foreachPartition on a RDD in Scala.

----
+If Java is perferred instead of Scala it will look a little different but 
still vary possible as we can see with this example.
+----

All functionality between Spark and HBase will be supported both in Scala and 
in Java, with the exception of SparkSQL which will support any language that is 
supported by Spark.  For the remaining of this documentation we will focus on 
Scala examples for now.

Now the examples above illustrate how to do a foreachPartition with a 
connection.  There are a number of other Spark base functions that are 
supported out of the box:

1.BulkPut: For massively parallel sending of puts to HBase
2.BulkDelete: For massively parallel sending of deletes to HBase
3.BulkGet: For massively parallel sending of gets to HBase to create a new RDD
4.MapPartition: To do a Spark Map function with a Connection object to allow 
full access to HBase
5.HBaseRDD: To simplify a distributed scan to create a RDD

Examples of all these functionalities can be found in the HBase-Spark Module.

== Spark Streaming
+http://spark.apache.org/streaming/[Spark Streaming] is a micro batching stream 
processing framework built on top of Spark.  HBase and Spark Streaming make 
great companies in that HBase can help serve the following benefits alongside 
Spark Streaming.

1.A place to grab reference data or profile data on the fly
2.A place to store counts or aggregates in a way that supports Spark Streaming 
promise of only once processing.  

The HBase-Spark module’s integration points with Spark Streaming are very 
similar to its normal Spark integration points.  In that the following commands 
are possible straight off a Spark Streaming DStream.

1.BulkPut: For massively parallel sending of puts to HBase
2.BulkDelete: For massively parallel sending of deletes to HBase
3.BulkGet: For massively parallel sending of gets to HBase to create a new RDD
4.ForeachPartition: To do a Spark Foreach function with a Connection object to 
allow full access to HBase
5.MapPartitions: To do a Spark Map function with a Connection object to allow 
full access to HBase

Below is an example of bulkPut with DStreams, as you will see it is very close 
in feel as the RDD bulk put.
+1. The hbaseContext that carries the configuration boardcast information link 
us to the HBase Connections in the executors
+Spark bulk load follows very closely to the MapReduce implementation of bulk 
load.  In short there is a partitioner that partitions based on region splits 
and the row keys are sent to the reducers in order so that HFiles can be 
written out.  In Spark terms the bulk load will be focused around a 
repartitionAndSortWithinPartitions followed by a foreachPartition.

The only major difference with the Spark implementation compared to the 
MapReduce implementation is that the column qualifier is included in the 
shuffle ordering process.  This was done because the MapReduce bulk load 
implementation would have memory issues with loading rows with a large numbers 
of columns.  This memory issue was a result of the sorting of those columns 
being done in the memory of the reducer JVM.  Now that ordering is done in the 
Spark Shuffle there should no longer be a limit to the number of columns in a 
row for bulk loading.

Below is simple code of what bulk loading with Spark would look like
+2. A function that will convert a record in the RDD to a tuple key value par.  
With the tuple key being a KeyFamilyQualifer object and the value being the 
cell value.  The KeyFamilyQualifer object will hold the RowKey, Column Family, 
and Column Qualifier.  The shuffle will partition on the RowKey but will sort 
by all three values.
+Then following the Spark bulk load command we need to use the HBase's 
LoadIncrementalHFiles object to load the newly created HFiles into HBase.
+Now there are advance options for bulk load with Spark.  We can set the 
following attributes with addition parameter options on hbaseBulkLoad.

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

    {color:green}+1 core tests{color}.  The patch passed unit tests in .

     {color:red}-1 core zombie tests{color}.  There are 1 zombie test(s):       
at 
org.apache.hadoop.hdfs.qjournal.client.TestQJMWithFaults.testRecoverAfterDoubleFailures(TestQJMWithFaults.java:147)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/15967//testReport/
Release Findbugs (version 2.0.3)        warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/15967//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/15967//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/15967//console

This message is automatically generated.

> Add documentation for Initial Release for HBase-Spark Module integration 
> -------------------------------------------------------------------------
>
>                 Key: HBASE-14158
>                 URL: https://issues.apache.org/jira/browse/HBASE-14158
>             Project: HBase
>          Issue Type: Improvement
>          Components: documentation, spark
>            Reporter: Ted Malaska
>            Assignee: Ted Malaska
>             Fix For: 2.0.0
>
>         Attachments: HBASE-14158.1.patch, HBASE-14158.2.patch, 
> HBASE-14158.5.patch, HBASE-14158.5.patch, HBASE-14158.6.patch, 
> HBASE-14158.7.patch
>
>
> Add documentation for Initial Release for HBase-Spark Module integration 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to