Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Thanks Michal,

Just to share what I'm working on in a related topic.  So a long time ago I
build SparkOnHBase and put it into Cloudera Labs in this link.
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

Also recently I am working on getting this into HBase core.  It will
hopefully be in HBase core with in the next couple of weeks.

https://issues.apache.org/jira/browse/HBASE-13992

Then I'm planing on adding dataframe and bulk load support through

https://issues.apache.org/jira/browse/HBASE-14149
https://issues.apache.org/jira/browse/HBASE-14150

Also if you are interested this is running today a at least a half a dozen
companies with Spark Streaming.  Here is one blog post of successful
implementation

http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

Also here is an additional example blog I also put together

http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

Let me know if you have any questions, also let me know if you want to
connect to join efforts.

Ted Malaska

On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com
wrote:

 Hi all, last couple of months I've been working on a large graph analytics
 and along the way have written from scratch a HBase-Spark integration as
 none of the ones out there worked either in terms of scale or in the way
 they integrated with the RDD interface. This week I have generalised it
 into an (almost) spark module, which works with the latest spark and the
 new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR



Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Oops, yes, I'm still messing with the repo on a daily basis.. fixed

On 28 July 2015 at 17:11, Ted Yu yuzhih...@gmail.com wrote:

 I got a compilation error:

 [INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling
 [INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes
 at 1438099569598
 [ERROR]
 /home/hbase/s-on-hbase/src/main/scala/org/apache/spark/hbase/examples/simple/HBaseTableSimple.scala:36:
 error: type mismatch;
 [INFO]  found   : Int
 [INFO]  required: Short
 [INFO]   while (scanner.advance) numCells += 1
 [INFO]^
 [ERROR] one error found

 FYI

 On Tue, Jul 28, 2015 at 8:59 AM, Michal Haris michal.ha...@visualdna.com
 wrote:

 Hi all, last couple of months I've been working on a large graph
 analytics and along the way have written from scratch a HBase-Spark
 integration as none of the ones out there worked either in terms of scale
 or in the way they integrated with the RDD interface. This week I have
 generalised it into an (almost) spark module, which works with the latest
 spark and the new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





-- 
Michal Haris
Technical Architect
direct line: +44 (0) 207 749 0229
www.visualdna.com | t: +44 (0) 207 734 7033
31 Old Nichol Street
London
E2 7HR


Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Hi Ted, yes, cloudera blog and your code was my starting point - but I
needed something more spark-centric rather than on hbase. Basically doing a
lot of ad-hoc transformations with RDDs that were based on HBase tables and
then mutating them after series of iterative (bsp-like) steps.

On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote:

 Thanks Michal,

 Just to share what I'm working on in a related topic.  So a long time ago
 I build SparkOnHBase and put it into Cloudera Labs in this link.
 http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

 Also recently I am working on getting this into HBase core.  It will
 hopefully be in HBase core with in the next couple of weeks.

 https://issues.apache.org/jira/browse/HBASE-13992

 Then I'm planing on adding dataframe and bulk load support through

 https://issues.apache.org/jira/browse/HBASE-14149
 https://issues.apache.org/jira/browse/HBASE-14150

 Also if you are interested this is running today a at least a half a dozen
 companies with Spark Streaming.  Here is one blog post of successful
 implementation


 http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

 Also here is an additional example blog I also put together


 http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

 Let me know if you have any questions, also let me know if you want to
 connect to join efforts.

 Ted Malaska

 On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com
  wrote:

 Hi all, last couple of months I've been working on a large graph
 analytics and along the way have written from scratch a HBase-Spark
 integration as none of the ones out there worked either in terms of scale
 or in the way they integrated with the RDD interface. This week I have
 generalised it into an (almost) spark module, which works with the latest
 spark and the new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





-- 
Michal Haris
Technical Architect
direct line: +44 (0) 207 749 0229
www.visualdna.com | t: +44 (0) 207 734 7033
31 Old Nichol Street
London
E2 7HR


Re: Generalised Spark-HBase integration

2015-07-28 Thread Jules Damji

Brilliant! Will check it out.

Cheers
Jules

--
The Best Ideas Are Simple
Jules Damji
Developer Relations  Community Outreach
jda...@hortonworks.com
http://hortonworks.com

On 7/28/15, 8:59 AM, Michal Haris 
michal.ha...@visualdna.commailto:michal.ha...@visualdna.com wrote:

Hi all, last couple of months I've been working on a large graph analytics and 
along the way have written from scratch a HBase-Spark integration as none of 
the ones out there worked either in terms of scale or in the way they 
integrated with the RDD interface. This week I have generalised it into an 
(almost) spark module, which works with the latest spark and the new hbase api, 
so... sharing! :  https://github.com/michal-harish/spark-on-hbase


--
Michal Haris
Technical Architect
direct line: +44 (0) 207 749 0229
www.visualdna.comhttp://www.visualdna.com | t: +44 (0) 207 734 7033
31 Old Nichol Street
London
E2 7HR


Re: Generalised Spark-HBase integration

2015-07-28 Thread Michal Haris
Cool, will revisit, is your latest code visible publicly somewhere ?

On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote:

 Yup you should be able to do that with the APIs that are going into HBase.

 Let me know if you need to chat about the problem and how to implement it
 with the HBase apis.

 We have tried to cover any possible way to use HBase with Spark.  Let us
 know if we missed anything if we did we will add it.

 On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris michal.ha...@visualdna.com
  wrote:

 Hi Ted, yes, cloudera blog and your code was my starting point - but I
 needed something more spark-centric rather than on hbase. Basically doing a
 lot of ad-hoc transformations with RDDs that were based on HBase tables and
 then mutating them after series of iterative (bsp-like) steps.

 On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote:

 Thanks Michal,

 Just to share what I'm working on in a related topic.  So a long time
 ago I build SparkOnHBase and put it into Cloudera Labs in this link.
 http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

 Also recently I am working on getting this into HBase core.  It will
 hopefully be in HBase core with in the next couple of weeks.

 https://issues.apache.org/jira/browse/HBASE-13992

 Then I'm planing on adding dataframe and bulk load support through

 https://issues.apache.org/jira/browse/HBASE-14149
 https://issues.apache.org/jira/browse/HBASE-14150

 Also if you are interested this is running today a at least a half a
 dozen companies with Spark Streaming.  Here is one blog post of successful
 implementation


 http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

 Also here is an additional example blog I also put together


 http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

 Let me know if you have any questions, also let me know if you want to
 connect to join efforts.

 Ted Malaska

 On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi all, last couple of months I've been working on a large graph
 analytics and along the way have written from scratch a HBase-Spark
 integration as none of the ones out there worked either in terms of scale
 or in the way they integrated with the RDD interface. This week I have
 generalised it into an (almost) spark module, which works with the latest
 spark and the new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





-- 
Michal Haris
Technical Architect
direct line: +44 (0) 207 749 0229
www.visualdna.com | t: +44 (0) 207 734 7033
31 Old Nichol Street
London
E2 7HR


Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Yu
I got a compilation error:

[INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling
[INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes
at 1438099569598
[ERROR]
/home/hbase/s-on-hbase/src/main/scala/org/apache/spark/hbase/examples/simple/HBaseTableSimple.scala:36:
error: type mismatch;
[INFO]  found   : Int
[INFO]  required: Short
[INFO]   while (scanner.advance) numCells += 1
[INFO]^
[ERROR] one error found

FYI

On Tue, Jul 28, 2015 at 8:59 AM, Michal Haris michal.ha...@visualdna.com
wrote:

 Hi all, last couple of months I've been working on a large graph analytics
 and along the way have written from scratch a HBase-Spark integration as
 none of the ones out there worked either in terms of scale or in the way
 they integrated with the RDD interface. This week I have generalised it
 into an (almost) spark module, which works with the latest spark and the
 new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR



Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Yup you should be able to do that with the APIs that are going into HBase.

Let me know if you need to chat about the problem and how to implement it
with the HBase apis.

We have tried to cover any possible way to use HBase with Spark.  Let us
know if we missed anything if we did we will add it.

On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris michal.ha...@visualdna.com
wrote:

 Hi Ted, yes, cloudera blog and your code was my starting point - but I
 needed something more spark-centric rather than on hbase. Basically doing a
 lot of ad-hoc transformations with RDDs that were based on HBase tables and
 then mutating them after series of iterative (bsp-like) steps.

 On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote:

 Thanks Michal,

 Just to share what I'm working on in a related topic.  So a long time ago
 I build SparkOnHBase and put it into Cloudera Labs in this link.
 http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

 Also recently I am working on getting this into HBase core.  It will
 hopefully be in HBase core with in the next couple of weeks.

 https://issues.apache.org/jira/browse/HBASE-13992

 Then I'm planing on adding dataframe and bulk load support through

 https://issues.apache.org/jira/browse/HBASE-14149
 https://issues.apache.org/jira/browse/HBASE-14150

 Also if you are interested this is running today a at least a half a
 dozen companies with Spark Streaming.  Here is one blog post of successful
 implementation


 http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

 Also here is an additional example blog I also put together


 http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

 Let me know if you have any questions, also let me know if you want to
 connect to join efforts.

 Ted Malaska

 On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi all, last couple of months I've been working on a large graph
 analytics and along the way have written from scratch a HBase-Spark
 integration as none of the ones out there worked either in terms of scale
 or in the way they integrated with the RDD interface. This week I have
 generalised it into an (almost) spark module, which works with the latest
 spark and the new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR



Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Stuff that people are using is here.

https://github.com/cloudera-labs/SparkOnHBase

The stuff going into HBase is here
https://issues.apache.org/jira/browse/HBASE-13992

If you want to add things to the hbase ticket lets do it in another jira.
Like these jira

https://issues.apache.org/jira/browse/HBASE-14149
https://issues.apache.org/jira/browse/HBASE-14150

This first jira is mainly getting the Spark dependancies and separate
module set up so we can start making additional jiras to add additional
functionality.

The goal is to have the following in HBase by end of summer:

RDD and DStream Functions
1. BulkPut
2. BulkGet
3. BulkDelete
4. Foreach with connection
5. Map with connection
6. Distributed Scan
7. BulkLoad

DataFrame Functions
1. BulkPut
2. BulkGet
6. Distributed Scan
7. BulkLoad

If you think there should be more let me know

Ted Malaska


On Tue, Jul 28, 2015 at 12:17 PM, Michal Haris michal.ha...@visualdna.com
wrote:

 Cool, will revisit, is your latest code visible publicly somewhere ?

 On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote:

 Yup you should be able to do that with the APIs that are going into HBase.

 Let me know if you need to chat about the problem and how to implement it
 with the HBase apis.

 We have tried to cover any possible way to use HBase with Spark.  Let us
 know if we missed anything if we did we will add it.

 On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi Ted, yes, cloudera blog and your code was my starting point - but I
 needed something more spark-centric rather than on hbase. Basically doing a
 lot of ad-hoc transformations with RDDs that were based on HBase tables and
 then mutating them after series of iterative (bsp-like) steps.

 On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote:

 Thanks Michal,

 Just to share what I'm working on in a related topic.  So a long time
 ago I build SparkOnHBase and put it into Cloudera Labs in this link.
 http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

 Also recently I am working on getting this into HBase core.  It will
 hopefully be in HBase core with in the next couple of weeks.

 https://issues.apache.org/jira/browse/HBASE-13992

 Then I'm planing on adding dataframe and bulk load support through

 https://issues.apache.org/jira/browse/HBASE-14149
 https://issues.apache.org/jira/browse/HBASE-14150

 Also if you are interested this is running today a at least a half a
 dozen companies with Spark Streaming.  Here is one blog post of successful
 implementation


 http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

 Also here is an additional example blog I also put together


 http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

 Let me know if you have any questions, also let me know if you want to
 connect to join efforts.

 Ted Malaska

 On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi all, last couple of months I've been working on a large graph
 analytics and along the way have written from scratch a HBase-Spark
 integration as none of the ones out there worked either in terms of scale
 or in the way they integrated with the RDD interface. This week I have
 generalised it into an (almost) spark module, which works with the latest
 spark and the new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR



Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Sorry this is more correct

RDD and DStream Functions
1. BulkPut
2. BulkGet
3. BulkDelete
4. Foreach with connection
5. Map with connection
6. Distributed Scan
7. BulkLoad

DataFrame Functions
1. BulkPut
2. BulkGet
3. Foreach with connection
4. Map with connection
5. Distributed Scan
6. BulkLoad


On Tue, Jul 28, 2015 at 12:23 PM, Ted Malaska ted.mala...@cloudera.com
wrote:

 Stuff that people are using is here.

 https://github.com/cloudera-labs/SparkOnHBase

 The stuff going into HBase is here
 https://issues.apache.org/jira/browse/HBASE-13992

 If you want to add things to the hbase ticket lets do it in another jira.
 Like these jira

 https://issues.apache.org/jira/browse/HBASE-14149
 https://issues.apache.org/jira/browse/HBASE-14150

 This first jira is mainly getting the Spark dependancies and separate
 module set up so we can start making additional jiras to add additional
 functionality.

 The goal is to have the following in HBase by end of summer:

 RDD and DStream Functions
 1. BulkPut
 2. BulkGet
 3. BulkDelete
 4. Foreach with connection
 5. Map with connection
 6. Distributed Scan
 7. BulkLoad

 DataFrame Functions
 1. BulkPut
 2. BulkGet
 6. Distributed Scan
 7. BulkLoad

 If you think there should be more let me know

 Ted Malaska


 On Tue, Jul 28, 2015 at 12:17 PM, Michal Haris michal.ha...@visualdna.com
  wrote:

 Cool, will revisit, is your latest code visible publicly somewhere ?

 On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote:

 Yup you should be able to do that with the APIs that are going into
 HBase.

 Let me know if you need to chat about the problem and how to implement
 it with the HBase apis.

 We have tried to cover any possible way to use HBase with Spark.  Let us
 know if we missed anything if we did we will add it.

 On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi Ted, yes, cloudera blog and your code was my starting point - but I
 needed something more spark-centric rather than on hbase. Basically doing a
 lot of ad-hoc transformations with RDDs that were based on HBase tables and
 then mutating them after series of iterative (bsp-like) steps.

 On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote:

 Thanks Michal,

 Just to share what I'm working on in a related topic.  So a long time
 ago I build SparkOnHBase and put it into Cloudera Labs in this link.
 http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

 Also recently I am working on getting this into HBase core.  It will
 hopefully be in HBase core with in the next couple of weeks.

 https://issues.apache.org/jira/browse/HBASE-13992

 Then I'm planing on adding dataframe and bulk load support through

 https://issues.apache.org/jira/browse/HBASE-14149
 https://issues.apache.org/jira/browse/HBASE-14150

 Also if you are interested this is running today a at least a half a
 dozen companies with Spark Streaming.  Here is one blog post of successful
 implementation


 http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

 Also here is an additional example blog I also put together


 http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

 Let me know if you have any questions, also let me know if you want to
 connect to join efforts.

 Ted Malaska

 On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi all, last couple of months I've been working on a large graph
 analytics and along the way have written from scratch a HBase-Spark
 integration as none of the ones out there worked either in terms of scale
 or in the way they integrated with the RDD interface. This week I have
 generalised it into an (almost) spark module, which works with the latest
 spark and the new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR