Re: Generalised Spark-HBase integration
Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will hopefully be in HBase core with in the next couple of weeks. https://issues.apache.org/jira/browse/HBASE-13992 Then I'm planing on adding dataframe and bulk load support through https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 Also if you are interested this is running today a at least a half a dozen companies with Spark Streaming. Here is one blog post of successful implementation http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/ Also here is an additional example blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
Oops, yes, I'm still messing with the repo on a daily basis.. fixed On 28 July 2015 at 17:11, Ted Yu yuzhih...@gmail.com wrote: I got a compilation error: [INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling [INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes at 1438099569598 [ERROR] /home/hbase/s-on-hbase/src/main/scala/org/apache/spark/hbase/examples/simple/HBaseTableSimple.scala:36: error: type mismatch; [INFO] found : Int [INFO] required: Short [INFO] while (scanner.advance) numCells += 1 [INFO]^ [ERROR] one error found FYI On Tue, Jul 28, 2015 at 8:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote: Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will hopefully be in HBase core with in the next couple of weeks. https://issues.apache.org/jira/browse/HBASE-13992 Then I'm planing on adding dataframe and bulk load support through https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 Also if you are interested this is running today a at least a half a dozen companies with Spark Streaming. Here is one blog post of successful implementation http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/ Also here is an additional example blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
Brilliant! Will check it out. Cheers Jules -- The Best Ideas Are Simple Jules Damji Developer Relations Community Outreach jda...@hortonworks.com http://hortonworks.com On 7/28/15, 8:59 AM, Michal Haris michal.ha...@visualdna.commailto:michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.comhttp://www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
Cool, will revisit, is your latest code visible publicly somewhere ? On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote: Yup you should be able to do that with the APIs that are going into HBase. Let me know if you need to chat about the problem and how to implement it with the HBase apis. We have tried to cover any possible way to use HBase with Spark. Let us know if we missed anything if we did we will add it. On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris michal.ha...@visualdna.com wrote: Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote: Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will hopefully be in HBase core with in the next couple of weeks. https://issues.apache.org/jira/browse/HBASE-13992 Then I'm planing on adding dataframe and bulk load support through https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 Also if you are interested this is running today a at least a half a dozen companies with Spark Streaming. Here is one blog post of successful implementation http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/ Also here is an additional example blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
I got a compilation error: [INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling [INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes at 1438099569598 [ERROR] /home/hbase/s-on-hbase/src/main/scala/org/apache/spark/hbase/examples/simple/HBaseTableSimple.scala:36: error: type mismatch; [INFO] found : Int [INFO] required: Short [INFO] while (scanner.advance) numCells += 1 [INFO]^ [ERROR] one error found FYI On Tue, Jul 28, 2015 at 8:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
Yup you should be able to do that with the APIs that are going into HBase. Let me know if you need to chat about the problem and how to implement it with the HBase apis. We have tried to cover any possible way to use HBase with Spark. Let us know if we missed anything if we did we will add it. On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris michal.ha...@visualdna.com wrote: Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote: Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will hopefully be in HBase core with in the next couple of weeks. https://issues.apache.org/jira/browse/HBASE-13992 Then I'm planing on adding dataframe and bulk load support through https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 Also if you are interested this is running today a at least a half a dozen companies with Spark Streaming. Here is one blog post of successful implementation http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/ Also here is an additional example blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
Stuff that people are using is here. https://github.com/cloudera-labs/SparkOnHBase The stuff going into HBase is here https://issues.apache.org/jira/browse/HBASE-13992 If you want to add things to the hbase ticket lets do it in another jira. Like these jira https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 This first jira is mainly getting the Spark dependancies and separate module set up so we can start making additional jiras to add additional functionality. The goal is to have the following in HBase by end of summer: RDD and DStream Functions 1. BulkPut 2. BulkGet 3. BulkDelete 4. Foreach with connection 5. Map with connection 6. Distributed Scan 7. BulkLoad DataFrame Functions 1. BulkPut 2. BulkGet 6. Distributed Scan 7. BulkLoad If you think there should be more let me know Ted Malaska On Tue, Jul 28, 2015 at 12:17 PM, Michal Haris michal.ha...@visualdna.com wrote: Cool, will revisit, is your latest code visible publicly somewhere ? On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote: Yup you should be able to do that with the APIs that are going into HBase. Let me know if you need to chat about the problem and how to implement it with the HBase apis. We have tried to cover any possible way to use HBase with Spark. Let us know if we missed anything if we did we will add it. On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris michal.ha...@visualdna.com wrote: Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote: Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will hopefully be in HBase core with in the next couple of weeks. https://issues.apache.org/jira/browse/HBASE-13992 Then I'm planing on adding dataframe and bulk load support through https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 Also if you are interested this is running today a at least a half a dozen companies with Spark Streaming. Here is one blog post of successful implementation http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/ Also here is an additional example blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
Sorry this is more correct RDD and DStream Functions 1. BulkPut 2. BulkGet 3. BulkDelete 4. Foreach with connection 5. Map with connection 6. Distributed Scan 7. BulkLoad DataFrame Functions 1. BulkPut 2. BulkGet 3. Foreach with connection 4. Map with connection 5. Distributed Scan 6. BulkLoad On Tue, Jul 28, 2015 at 12:23 PM, Ted Malaska ted.mala...@cloudera.com wrote: Stuff that people are using is here. https://github.com/cloudera-labs/SparkOnHBase The stuff going into HBase is here https://issues.apache.org/jira/browse/HBASE-13992 If you want to add things to the hbase ticket lets do it in another jira. Like these jira https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 This first jira is mainly getting the Spark dependancies and separate module set up so we can start making additional jiras to add additional functionality. The goal is to have the following in HBase by end of summer: RDD and DStream Functions 1. BulkPut 2. BulkGet 3. BulkDelete 4. Foreach with connection 5. Map with connection 6. Distributed Scan 7. BulkLoad DataFrame Functions 1. BulkPut 2. BulkGet 6. Distributed Scan 7. BulkLoad If you think there should be more let me know Ted Malaska On Tue, Jul 28, 2015 at 12:17 PM, Michal Haris michal.ha...@visualdna.com wrote: Cool, will revisit, is your latest code visible publicly somewhere ? On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote: Yup you should be able to do that with the APIs that are going into HBase. Let me know if you need to chat about the problem and how to implement it with the HBase apis. We have tried to cover any possible way to use HBase with Spark. Let us know if we missed anything if we did we will add it. On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris michal.ha...@visualdna.com wrote: Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote: Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will hopefully be in HBase core with in the next couple of weeks. https://issues.apache.org/jira/browse/HBASE-13992 Then I'm planing on adding dataframe and bulk load support through https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 Also if you are interested this is running today a at least a half a dozen companies with Spark Streaming. Here is one blog post of successful implementation http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/ Also here is an additional example blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR