Re: 答复: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-13 Thread Ted Malaska
Cool seems like the design are very close.

Here is my latest blog on my work with HBase and Spark.  Let me know if you
have any questions.  There should be two more blogs next month talking
about bulk load through spark 14150 which is committed, and SparkSQL 14181
which should be done next week.

http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/

On Wed, Aug 12, 2015 at 12:18 AM, Yan Zhou.sc yan.zhou...@huawei.com
wrote:

 We are using MR-based bulk loading on Spark.



 For filter pushdown, Astro does partition-pruning, scan range pruning, and
 use Gets as much as possible.



 Thanks,





 *发件人:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *发送时间:* 2015年8月12日 9:14
 *收件人:* Yan Zhou.sc
 *抄送:* dev@spark.apache.org; Bing Xiao (Bing); Ted Yu; user
 *主题:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro



 There a number of ways to bulk load.

 There is bulk put, partition bulk put, mr bulk load, and now hbase-14150
 which is spark shuffle bulk load.

 Let me know if I have missed a bulk loading option.  All these r possible
 with the new hbase-spark module.

 As for the filter push down discussion in the past email.  U will note in
 14181 that the filter push will also limit the scan range or drop scan all
 together for gets.

 Ted Malaska

 On Aug 11, 2015 9:06 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 No, Astro bulkloader does not use its own shuffle. But map/reduce-side
 processing is somewhat different from HBase’s bulk loader that are used by
 many HBase apps I believe.



 *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *Sent:* Wednesday, August 12, 2015 8:56 AM
 *To:* Yan Zhou.sc
 *Cc:* dev@spark.apache.org; Ted Yu; Bing Xiao (Bing); user
 *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 The bulk load code is 14150 if u r interested.  Let me know how it can be
 made faster.

 It's just a spark shuffle and writing hfiles.   Unless astro wrote it's
 own shuffle the times should be very close.

 On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Ted,



 Thanks for pointing out more details of HBase-14181. I am afraid I may
 still need to learn more before I can make very accurate and pointed
 comments.



 As for filter push down, Astro has a powerful approach to basically break
 down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT

 to generate partition-specific predicates to be pushed down to HBase. This
 may not be a significant performance improvement if the filter logic is
 simple and/or the processing is IO-bound,

 but could be so for online ad-hoc analysis.



 For UDFs, Astro supports it both in and out of HBase custom filter.



 For secondary index, Astro do not support it now. With the probable
 support by HBase in the future(thanks to Ted Yu’s comments a while ago), we
 could add this support along with its specific optimizations.



 For bulk load, Astro has a much faster way to load the tabular data, we
 believe.



 Right now, Astro’s filter pushdown is through HBase built-in filters and
 custom filter.



 As for HBase-14181, I see some overlaps with Astro. Both have dependences
 on Spark SQL, and both supports Spark Dataframe as an access interface,
 both supports predicate pushdown.

 Astro is not designed for MR (or Spark’s equivalent) access though.



 If HBase-14181 is shooting for access to HBase data through a subset of
 DataFrame functionalities like filter, projection, and other map-side ops,
 would it be feasible to decouple it from Spark?

 My understanding is that 14181 does not run Spark execution engine at all,
 but will make use of Spark Dataframe semantic and/or logic planning to pass
 a logic (sub-)plan to the HBase. If true, it might

 be desirable to directly support Dataframe in HBase.



 Thanks,





 *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *Sent:* Wednesday, August 12, 2015 7:28 AM
 *To:* Yan Zhou.sc
 *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
 *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Hey Yan,

 I've been the one building out this spark functionality in hbase so maybe
 I can help clarify.

 The hbase-spark module is just focused on making spark integration with
 hbase easy and out of the box for both spark and spark streaming.

 I and I believe the hbase team has no desire to build a sql engine in
 hbase.  This jira comes the closest to that line.  The main thing here is
 filter push down logic for basic sql operation like =, 
 , and .  User define functions and secondary indexes are not in my scope.

 Another main goal of hbase-spark module is to be able to allow a user to
 do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
 load.

 Let me know if u have any questions

 Ted Malaska

 On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 We have not “formally” published any numbers yet. A good reference

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Malaska
The bulk load code is 14150 if u r interested.  Let me know how it can be
made faster.

It's just a spark shuffle and writing hfiles.   Unless astro wrote it's own
shuffle the times should be very close.
On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Ted,



 Thanks for pointing out more details of HBase-14181. I am afraid I may
 still need to learn more before I can make very accurate and pointed
 comments.



 As for filter push down, Astro has a powerful approach to basically break
 down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT

 to generate partition-specific predicates to be pushed down to HBase. This
 may not be a significant performance improvement if the filter logic is
 simple and/or the processing is IO-bound,

 but could be so for online ad-hoc analysis.



 For UDFs, Astro supports it both in and out of HBase custom filter.



 For secondary index, Astro do not support it now. With the probable
 support by HBase in the future(thanks to Ted Yu’s comments a while ago), we
 could add this support along with its specific optimizations.



 For bulk load, Astro has a much faster way to load the tabular data, we
 believe.



 Right now, Astro’s filter pushdown is through HBase built-in filters and
 custom filter.



 As for HBase-14181, I see some overlaps with Astro. Both have dependences
 on Spark SQL, and both supports Spark Dataframe as an access interface,
 both supports predicate pushdown.

 Astro is not designed for MR (or Spark’s equivalent) access though.



 If HBase-14181 is shooting for access to HBase data through a subset of
 DataFrame functionalities like filter, projection, and other map-side ops,
 would it be feasible to decouple it from Spark?

 My understanding is that 14181 does not run Spark execution engine at all,
 but will make use of Spark Dataframe semantic and/or logic planning to pass
 a logic (sub-)plan to the HBase. If true, it might

 be desirable to directly support Dataframe in HBase.



 Thanks,





 *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *Sent:* Wednesday, August 12, 2015 7:28 AM
 *To:* Yan Zhou.sc
 *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
 *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Hey Yan,

 I've been the one building out this spark functionality in hbase so maybe
 I can help clarify.

 The hbase-spark module is just focused on making spark integration with
 hbase easy and out of the box for both spark and spark streaming.

 I and I believe the hbase team has no desire to build a sql engine in
 hbase.  This jira comes the closest to that line.  The main thing here is
 filter push down logic for basic sql operation like =, 
 , and .  User define functions and secondary indexes are not in my scope.

 Another main goal of hbase-spark module is to be able to allow a user to
 do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
 load.

 Let me know if u have any questions

 Ted Malaska

 On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 We have not “formally” published any numbers yet. A good reference is a
 slide deck we posted for the meetup in March.

 , or better yet for interested parties to run performance comparisons by
 themselves for now.



 As for status quo of Astro, we have been focusing on fixing bugs
 (UDF-related bug in some coprocessor/custom filter combos), and add support
 of querying string columns in HBase as integers from Astro.



 Thanks,



 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Wednesday, August 12, 2015 7:02 AM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Yan:

 Where can I find performance numbers for Astro (it's close to middle of
 August) ?



 Cheers



 On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com
 wrote:

 Finally I can take a look at HBASE-14181 now. Unfortunately there is no
 design doc mentioned. Superficially it is very similar to Astro with a
 difference of

 this being part of HBase client library; while Astro works as a Spark
 package so will evolve and function more closely with Spark SQL/Dataframe
 instead of HBase.



 In terms of architecture, my take is loosely-coupled query engines on top
 of KV store vs. an array of query engines supported by, and packaged as
 part of, a KV store.



 Functionality-wise the two could be close but Astro also supports Python
 as a result of tight integration with Spark.

 It will be interesting to see performance comparisons when HBase-14181 is
 ready.



 Thanks,





 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Tuesday, August 11, 2015 3:28 PM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro



 HBase will not have query engine.



 It will provide

答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design 
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package 
so will evolve and function more closely with Spark SQL/Dataframe instead of 
HBase.

In terms of architecture, my take is loosely-coupled query engines on top of KV 
store vs. an array of query engines supported by, and packaged as part of, a KV 
store.

Functionality-wise the two could be close but Astro also supports Python as a 
result of tight integration with Spark.
It will be interesting to see performance comparisons when HBase-14181 is ready.

Thanks,


From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, August 11, 2015 3:28 PM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
Subject: Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro

HBase will not have query engine.

It will provide better support to query engines.

Cheers


On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
Ted,

I’m in China now, and seem to experience difficulty to access Apache Jira. 
Anyways, it appears to me  that 
HBASE-14181https://issues.apache.org/jira/browse/HBASE-14181 attempts to 
support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a built-in 
query engine or not. Or it will stick with the current way as
a k-v store with some built-in processing capabilities in the forms of 
coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
engines
built on top of it.

Thanks,

发件人: Ted Yu [mailto:yuzhih...@gmail.com]
发送时间: 2015年8月11日 8:54
收件人: Bing Xiao (Bing)
抄送: dev@spark.apache.orgmailto:dev@spark.apache.org; 
u...@spark.apache.orgmailto:u...@spark.apache.org; Yan Zhou.sc
主题: Re: Package Release Annoucement: Spark SQL on HBase Astro

Yan / Bing:
Mind taking a look at 
HBASE-14181https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark 
DataFrame DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
bing.x...@huawei.commailto:bing.x...@huawei.com wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team




Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Yu
Yan:
Where can I find performance numbers for Astro (it's close to middle of
August) ?

Cheers

On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Finally I can take a look at HBASE-14181 now. Unfortunately there is no
 design doc mentioned. Superficially it is very similar to Astro with a
 difference of

 this being part of HBase client library; while Astro works as a Spark
 package so will evolve and function more closely with Spark SQL/Dataframe
 instead of HBase.



 In terms of architecture, my take is loosely-coupled query engines on top
 of KV store vs. an array of query engines supported by, and packaged as
 part of, a KV store.



 Functionality-wise the two could be close but Astro also supports Python
 as a result of tight integration with Spark.

 It will be interesting to see performance comparisons when HBase-14181 is
 ready.



 Thanks,





 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Tuesday, August 11, 2015 3:28 PM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro



 HBase will not have query engine.



 It will provide better support to query engines.



 Cheers


 On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Ted,



 I’m in China now, and seem to experience difficulty to access Apache Jira.
 Anyways, it appears to me  that HBASE-14181
 https://issues.apache.org/jira/browse/HBASE-14181 attempts to support
 Spark DataFrame inside HBase.

 If true, one question to me is whether HBase is intended to have a
 built-in query engine or not. Or it will stick with the current way as

 a k-v store with some built-in processing capabilities in the forms of
 coprocessor, custom filter, …, etc., which allows for loosely-coupled query
 engines

 built on top of it.



 Thanks,



 *发件人**:* Ted Yu [mailto:yuzhih...@gmail.com yuzhih...@gmail.com]
 *发送时间**:* 2015年8月11日 8:54
 *收件人**:* Bing Xiao (Bing)
 *抄送**:* dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
 *主题**:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Yan / Bing:

 Mind taking a look at HBASE-14181
 https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark DataFrame
 DataSource to HBase-Spark Module' ?



 Thanks



 On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com
 wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team








答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
Ted,

I’m in China now, and seem to experience difficulty to access Apache Jira. 
Anyways, it appears to me  that 
HBASE-14181https://issues.apache.org/jira/browse/HBASE-14181 attempts to 
support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a built-in 
query engine or not. Or it will stick with the current way as
a k-v store with some built-in processing capabilities in the forms of 
coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
engines
built on top of it.

Thanks,

发件人: Ted Yu [mailto:yuzhih...@gmail.com]
发送时间: 2015年8月11日 8:54
收件人: Bing Xiao (Bing)
抄送: dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
主题: Re: Package Release Annoucement: Spark SQL on HBase Astro

Yan / Bing:
Mind taking a look at 
HBASE-14181https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark 
DataFrame DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
bing.x...@huawei.commailto:bing.x...@huawei.com wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team




答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
Ok. Then a question will be to define a boundary between a query engine and a 
built-in processing. If, for instance, the Spark DataFrame functionalities 
involving shuffling are to be supported inside HBase,
in my opinion, it’d be hard not to tag it as an query engine. If, on the other 
hand, only map-side ops from DataFrame are to be supported inside HBase, then 
Astro’s coprocessor already has the capabilities.

Again, I still have no full knowledge about HBase-14181 beyond your description 
in email. So my opinion above might be skewed as result.

Regards,

Yan

发件人: Ted Yu [mailto:yuzhih...@gmail.com]
发送时间: 2015年8月11日 15:28
收件人: Yan Zhou.sc
抄送: Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
主题: Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro

HBase will not have query engine.

It will provide better support to query engines.

Cheers


On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
Ted,

I’m in China now, and seem to experience difficulty to access Apache Jira. 
Anyways, it appears to me  that 
HBASE-14181https://issues.apache.org/jira/browse/HBASE-14181 attempts to 
support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a built-in 
query engine or not. Or it will stick with the current way as
a k-v store with some built-in processing capabilities in the forms of 
coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
engines
built on top of it.

Thanks,

发件人: Ted Yu [mailto:yuzhih...@gmail.com]
发送时间: 2015年8月11日 8:54
收件人: Bing Xiao (Bing)
抄送: dev@spark.apache.orgmailto:dev@spark.apache.org; 
u...@spark.apache.orgmailto:u...@spark.apache.org; Yan Zhou.sc
主题: Re: Package Release Annoucement: Spark SQL on HBase Astro

Yan / Bing:
Mind taking a look at 
HBASE-14181https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark 
DataFrame DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
bing.x...@huawei.commailto:bing.x...@huawei.com wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team




Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Yu
HBase will not have query engine. 

It will provide better support to query engines. 

Cheers



 On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:
 
 Ted,
  
 I’m in China now, and seem to experience difficulty to access Apache Jira. 
 Anyways, it appears to me  that HBASE-14181 attempts to support Spark 
 DataFrame inside HBase.
 If true, one question to me is whether HBase is intended to have a built-in 
 query engine or not. Or it will stick with the current way as
 a k-v store with some built-in processing capabilities in the forms of 
 coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
 engines
 built on top of it.
  
 Thanks,
  
 发件人: Ted Yu [mailto:yuzhih...@gmail.com] 
 发送时间: 2015年8月11日 8:54
 收件人: Bing Xiao (Bing)
 抄送: dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
 主题: Re: Package Release Annoucement: Spark SQL on HBase Astro
  
 Yan / Bing:
 Mind taking a look at HBASE-14181 'Add Spark DataFrame DataSource to 
 HBase-Spark Module' ?
  
 Thanks
  
 On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com 
 wrote:
 We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
 release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
 The main features in this package, dubbed “Astro”, include:
 · Systematic and powerful handling of data pruning and intelligent 
 scan, based on partial evaluation technique
 
 · HBase pushdown capabilities like custom filters and coprocessor to 
 support ultra low latency processing
 
 · SQL, Data Frame support
 
 · More SQL capabilities made possible (Secondary index, bloom filter, 
 Primary Key, Bulk load, Update)
 
 · Joins with data from other sources
 
 · Python/Java/Scala support
 
 · Support latest Spark 1.4.0 release
 
  
 
 The tests by Huawei team and community contributors covered the areas: bulk 
 load; projection pruning; partition pruning; partial evaluation; code 
 generation; coprocessor; customer filtering; DML; complex filtering on keys 
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column family 
 test.  We will post the test results including performance tests the middle 
 of August.
 You are very welcomed to try out or deploy the package, and help improve the 
 integration tests with various combinations of the settings, extensive Data 
 Frame tests, complex join/union test and extensive performance tests.  Please 
 use the “Issues” “Pull Requests” links at this package homepage, if you want 
 to report bugs, improvement or feature requests.
 Special thanks to project owner and technical leader Yan Zhou, Huawei global 
 team, community contributors and Databricks.   Databricks has been providing 
 great assistance from the design to the release.
 “Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
 query and analytics of large scale data sets in vertical enterprises. We will 
 continue to work with the community to develop new features and improve code 
 base.  Your comments and suggestions are greatly appreciated.
  
 Yan Zhou / Bing Xiao
 Huawei Big Data team
  
  


RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Malaska
There a number of ways to bulk load.

There is bulk put, partition bulk put, mr bulk load, and now hbase-14150
which is spark shuffle bulk load.

Let me know if I have missed a bulk loading option.  All these r possible
with the new hbase-spark module.

As for the filter push down discussion in the past email.  U will note in
14181 that the filter push will also limit the scan range or drop scan all
together for gets.

Ted Malaska
On Aug 11, 2015 9:06 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 No, Astro bulkloader does not use its own shuffle. But map/reduce-side
 processing is somewhat different from HBase’s bulk loader that are used by
 many HBase apps I believe.



 *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *Sent:* Wednesday, August 12, 2015 8:56 AM
 *To:* Yan Zhou.sc
 *Cc:* dev@spark.apache.org; Ted Yu; Bing Xiao (Bing); user
 *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 The bulk load code is 14150 if u r interested.  Let me know how it can be
 made faster.

 It's just a spark shuffle and writing hfiles.   Unless astro wrote it's
 own shuffle the times should be very close.

 On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Ted,



 Thanks for pointing out more details of HBase-14181. I am afraid I may
 still need to learn more before I can make very accurate and pointed
 comments.



 As for filter push down, Astro has a powerful approach to basically break
 down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT

 to generate partition-specific predicates to be pushed down to HBase. This
 may not be a significant performance improvement if the filter logic is
 simple and/or the processing is IO-bound,

 but could be so for online ad-hoc analysis.



 For UDFs, Astro supports it both in and out of HBase custom filter.



 For secondary index, Astro do not support it now. With the probable
 support by HBase in the future(thanks to Ted Yu’s comments a while ago), we
 could add this support along with its specific optimizations.



 For bulk load, Astro has a much faster way to load the tabular data, we
 believe.



 Right now, Astro’s filter pushdown is through HBase built-in filters and
 custom filter.



 As for HBase-14181, I see some overlaps with Astro. Both have dependences
 on Spark SQL, and both supports Spark Dataframe as an access interface,
 both supports predicate pushdown.

 Astro is not designed for MR (or Spark’s equivalent) access though.



 If HBase-14181 is shooting for access to HBase data through a subset of
 DataFrame functionalities like filter, projection, and other map-side ops,
 would it be feasible to decouple it from Spark?

 My understanding is that 14181 does not run Spark execution engine at all,
 but will make use of Spark Dataframe semantic and/or logic planning to pass
 a logic (sub-)plan to the HBase. If true, it might

 be desirable to directly support Dataframe in HBase.



 Thanks,





 *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *Sent:* Wednesday, August 12, 2015 7:28 AM
 *To:* Yan Zhou.sc
 *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
 *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Hey Yan,

 I've been the one building out this spark functionality in hbase so maybe
 I can help clarify.

 The hbase-spark module is just focused on making spark integration with
 hbase easy and out of the box for both spark and spark streaming.

 I and I believe the hbase team has no desire to build a sql engine in
 hbase.  This jira comes the closest to that line.  The main thing here is
 filter push down logic for basic sql operation like =, 
 , and .  User define functions and secondary indexes are not in my scope.

 Another main goal of hbase-spark module is to be able to allow a user to
 do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
 load.

 Let me know if u have any questions

 Ted Malaska

 On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 We have not “formally” published any numbers yet. A good reference is a
 slide deck we posted for the meetup in March.

 , or better yet for interested parties to run performance comparisons by
 themselves for now.



 As for status quo of Astro, we have been focusing on fixing bugs
 (UDF-related bug in some coprocessor/custom filter combos), and add support
 of querying string columns in HBase as integers from Astro.



 Thanks,



 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Wednesday, August 12, 2015 7:02 AM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Yan:

 Where can I find performance numbers for Astro (it's close to middle of
 August) ?



 Cheers



 On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com
 wrote:

 Finally I can take a look at HBASE-14181 now. Unfortunately there is no
 design

答复: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
We are using MR-based bulk loading on Spark.

For filter pushdown, Astro does partition-pruning, scan range pruning, and use 
Gets as much as possible.

Thanks,


发件人: Ted Malaska [mailto:ted.mala...@cloudera.com]
发送时间: 2015年8月12日 9:14
收件人: Yan Zhou.sc
抄送: dev@spark.apache.org; Bing Xiao (Bing); Ted Yu; user
主题: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro


There a number of ways to bulk load.

There is bulk put, partition bulk put, mr bulk load, and now hbase-14150 which 
is spark shuffle bulk load.

Let me know if I have missed a bulk loading option.  All these r possible with 
the new hbase-spark module.

As for the filter push down discussion in the past email.  U will note in 14181 
that the filter push will also limit the scan range or drop scan all together 
for gets.

Ted Malaska
On Aug 11, 2015 9:06 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
No, Astro bulkloader does not use its own shuffle. But map/reduce-side 
processing is somewhat different from HBase’s bulk loader that are used by many 
HBase apps I believe.

From: Ted Malaska 
[mailto:ted.mala...@cloudera.commailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 8:56 AM
To: Yan Zhou.sc
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org; Ted Yu; Bing Xiao 
(Bing); user
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro


The bulk load code is 14150 if u r interested.  Let me know how it can be made 
faster.

It's just a spark shuffle and writing hfiles.   Unless astro wrote it's own 
shuffle the times should be very close.
On Aug 11, 2015 8:49 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
Ted,

Thanks for pointing out more details of HBase-14181. I am afraid I may still 
need to learn more before I can make very accurate and pointed comments.

As for filter push down, Astro has a powerful approach to basically break down 
arbitrarily complex logic expressions comprising of AND/OR/IN/NOT
to generate partition-specific predicates to be pushed down to HBase. This may 
not be a significant performance improvement if the filter logic is simple 
and/or the processing is IO-bound,
but could be so for online ad-hoc analysis.

For UDFs, Astro supports it both in and out of HBase custom filter.

For secondary index, Astro do not support it now. With the probable support by 
HBase in the future(thanks to Ted Yu’s comments a while ago), we could add this 
support along with its specific optimizations.

For bulk load, Astro has a much faster way to load the tabular data, we believe.

Right now, Astro’s filter pushdown is through HBase built-in filters and custom 
filter.

As for HBase-14181, I see some overlaps with Astro. Both have dependences on 
Spark SQL, and both supports Spark Dataframe as an access interface, both 
supports predicate pushdown.
Astro is not designed for MR (or Spark’s equivalent) access though.

If HBase-14181 is shooting for access to HBase data through a subset of 
DataFrame functionalities like filter, projection, and other map-side ops, 
would it be feasible to decouple it from Spark?
My understanding is that 14181 does not run Spark execution engine at all, but 
will make use of Spark Dataframe semantic and/or logic planning to pass a logic 
(sub-)plan to the HBase. If true, it might
be desirable to directly support Dataframe in HBase.

Thanks,


From: Ted Malaska 
[mailto:ted.mala...@cloudera.commailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 7:28 AM
To: Yan Zhou.sc
Cc: user; dev@spark.apache.orgmailto:dev@spark.apache.org; Bing Xiao (Bing); 
Ted Yu
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro


Hey Yan,

I've been the one building out this spark functionality in hbase so maybe I can 
help clarify.

The hbase-spark module is just focused on making spark integration with hbase 
easy and out of the box for both spark and spark streaming.

I and I believe the hbase team has no desire to build a sql engine in hbase.  
This jira comes the closest to that line.  The main thing here is filter push 
down logic for basic sql operation like =, 
, and .  User define functions and secondary indexes are not in my scope.

Another main goal of hbase-spark module is to be able to allow a user to do  
anything they did with MR/HBase now with Spark/Hbase.  Things like bulk load.

Let me know if u have any questions

Ted Malaska
On Aug 11, 2015 7:13 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
We have not “formally” published any numbers yet. A good reference is a slide 
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by 
themselves for now.

As for status quo of Astro, we have been focusing on fixing bugs (UDF-related 
bug in some coprocessor/custom filter combos), and add support of querying 
string columns in HBase as integers from Astro.

Thanks,

From: Ted Yu [mailto:yuzhih

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
Ted,

Thanks for pointing out more details of HBase-14181. I am afraid I may still 
need to learn more before I can make very accurate and pointed comments.

As for filter push down, Astro has a powerful approach to basically break down 
arbitrarily complex logic expressions comprising of AND/OR/IN/NOT
to generate partition-specific predicates to be pushed down to HBase. This may 
not be a significant performance improvement if the filter logic is simple 
and/or the processing is IO-bound,
but could be so for online ad-hoc analysis.

For UDFs, Astro supports it both in and out of HBase custom filter.

For secondary index, Astro do not support it now. With the probable support by 
HBase in the future(thanks to Ted Yu’s comments a while ago), we could add this 
support along with its specific optimizations.

For bulk load, Astro has a much faster way to load the tabular data, we believe.

Right now, Astro’s filter pushdown is through HBase built-in filters and custom 
filter.

As for HBase-14181, I see some overlaps with Astro. Both have dependences on 
Spark SQL, and both supports Spark Dataframe as an access interface, both 
supports predicate pushdown.
Astro is not designed for MR (or Spark’s equivalent) access though.

If HBase-14181 is shooting for access to HBase data through a subset of 
DataFrame functionalities like filter, projection, and other map-side ops, 
would it be feasible to decouple it from Spark?
My understanding is that 14181 does not run Spark execution engine at all, but 
will make use of Spark Dataframe semantic and/or logic planning to pass a logic 
(sub-)plan to the HBase. If true, it might
be desirable to directly support Dataframe in HBase.

Thanks,


From: Ted Malaska [mailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 7:28 AM
To: Yan Zhou.sc
Cc: user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro


Hey Yan,

I've been the one building out this spark functionality in hbase so maybe I can 
help clarify.

The hbase-spark module is just focused on making spark integration with hbase 
easy and out of the box for both spark and spark streaming.

I and I believe the hbase team has no desire to build a sql engine in hbase.  
This jira comes the closest to that line.  The main thing here is filter push 
down logic for basic sql operation like =, 
, and .  User define functions and secondary indexes are not in my scope.

Another main goal of hbase-spark module is to be able to allow a user to do  
anything they did with MR/HBase now with Spark/Hbase.  Things like bulk load.

Let me know if u have any questions

Ted Malaska
On Aug 11, 2015 7:13 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
We have not “formally” published any numbers yet. A good reference is a slide 
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by 
themselves for now.

As for status quo of Astro, we have been focusing on fixing bugs (UDF-related 
bug in some coprocessor/custom filter combos), and add support of querying 
string columns in HBase as integers from Astro.

Thanks,

From: Ted Yu [mailto:yuzhih...@gmail.commailto:yuzhih...@gmail.com]
Sent: Wednesday, August 12, 2015 7:02 AM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.orgmailto:dev@spark.apache.org; 
u...@spark.apache.orgmailto:u...@spark.apache.org
Subject: Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

Yan:
Where can I find performance numbers for Astro (it's close to middle of August) 
?

Cheers

On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design 
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package 
so will evolve and function more closely with Spark SQL/Dataframe instead of 
HBase.

In terms of architecture, my take is loosely-coupled query engines on top of KV 
store vs. an array of query engines supported by, and packaged as part of, a KV 
store.

Functionality-wise the two could be close but Astro also supports Python as a 
result of tight integration with Spark.
It will be interesting to see performance comparisons when HBase-14181 is ready.

Thanks,


From: Ted Yu [mailto:yuzhih...@gmail.commailto:yuzhih...@gmail.com]
Sent: Tuesday, August 11, 2015 3:28 PM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.orgmailto:dev@spark.apache.org; 
u...@spark.apache.orgmailto:u...@spark.apache.org
Subject: Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro

HBase will not have query engine.

It will provide better support to query engines.

Cheers

On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
Ted,

I’m in China now, and seem to experience

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
No, Astro bulkloader does not use its own shuffle. But map/reduce-side 
processing is somewhat different from HBase’s bulk loader that are used by many 
HBase apps I believe.

From: Ted Malaska [mailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 8:56 AM
To: Yan Zhou.sc
Cc: dev@spark.apache.org; Ted Yu; Bing Xiao (Bing); user
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro


The bulk load code is 14150 if u r interested.  Let me know how it can be made 
faster.

It's just a spark shuffle and writing hfiles.   Unless astro wrote it's own 
shuffle the times should be very close.
On Aug 11, 2015 8:49 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
Ted,

Thanks for pointing out more details of HBase-14181. I am afraid I may still 
need to learn more before I can make very accurate and pointed comments.

As for filter push down, Astro has a powerful approach to basically break down 
arbitrarily complex logic expressions comprising of AND/OR/IN/NOT
to generate partition-specific predicates to be pushed down to HBase. This may 
not be a significant performance improvement if the filter logic is simple 
and/or the processing is IO-bound,
but could be so for online ad-hoc analysis.

For UDFs, Astro supports it both in and out of HBase custom filter.

For secondary index, Astro do not support it now. With the probable support by 
HBase in the future(thanks to Ted Yu’s comments a while ago), we could add this 
support along with its specific optimizations.

For bulk load, Astro has a much faster way to load the tabular data, we believe.

Right now, Astro’s filter pushdown is through HBase built-in filters and custom 
filter.

As for HBase-14181, I see some overlaps with Astro. Both have dependences on 
Spark SQL, and both supports Spark Dataframe as an access interface, both 
supports predicate pushdown.
Astro is not designed for MR (or Spark’s equivalent) access though.

If HBase-14181 is shooting for access to HBase data through a subset of 
DataFrame functionalities like filter, projection, and other map-side ops, 
would it be feasible to decouple it from Spark?
My understanding is that 14181 does not run Spark execution engine at all, but 
will make use of Spark Dataframe semantic and/or logic planning to pass a logic 
(sub-)plan to the HBase. If true, it might
be desirable to directly support Dataframe in HBase.

Thanks,


From: Ted Malaska 
[mailto:ted.mala...@cloudera.commailto:ted.mala...@cloudera.com]
Sent: Wednesday, August 12, 2015 7:28 AM
To: Yan Zhou.sc
Cc: user; dev@spark.apache.orgmailto:dev@spark.apache.org; Bing Xiao (Bing); 
Ted Yu
Subject: RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro


Hey Yan,

I've been the one building out this spark functionality in hbase so maybe I can 
help clarify.

The hbase-spark module is just focused on making spark integration with hbase 
easy and out of the box for both spark and spark streaming.

I and I believe the hbase team has no desire to build a sql engine in hbase.  
This jira comes the closest to that line.  The main thing here is filter push 
down logic for basic sql operation like =, 
, and .  User define functions and secondary indexes are not in my scope.

Another main goal of hbase-spark module is to be able to allow a user to do  
anything they did with MR/HBase now with Spark/Hbase.  Things like bulk load.

Let me know if u have any questions

Ted Malaska
On Aug 11, 2015 7:13 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
We have not “formally” published any numbers yet. A good reference is a slide 
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by 
themselves for now.

As for status quo of Astro, we have been focusing on fixing bugs (UDF-related 
bug in some coprocessor/custom filter combos), and add support of querying 
string columns in HBase as integers from Astro.

Thanks,

From: Ted Yu [mailto:yuzhih...@gmail.commailto:yuzhih...@gmail.com]
Sent: Wednesday, August 12, 2015 7:02 AM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.orgmailto:dev@spark.apache.org; 
u...@spark.apache.orgmailto:u...@spark.apache.org
Subject: Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

Yan:
Where can I find performance numbers for Astro (it's close to middle of August) 
?

Cheers

On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design 
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package 
so will evolve and function more closely with Spark SQL/Dataframe instead of 
HBase.

In terms of architecture, my take is loosely-coupled query engines on top of KV 
store vs. an array of query engines supported by, and packaged as part of, a KV

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Malaska
Hey Yan,

I've been the one building out this spark functionality in hbase so maybe I
can help clarify.

The hbase-spark module is just focused on making spark integration with
hbase easy and out of the box for both spark and spark streaming.

I and I believe the hbase team has no desire to build a sql engine in
hbase.  This jira comes the closest to that line.  The main thing here is
filter push down logic for basic sql operation like =, 
, and .  User define functions and secondary indexes are not in my scope.

Another main goal of hbase-spark module is to be able to allow a user to
do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
load.

Let me know if u have any questions

Ted Malaska
On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 We have not “formally” published any numbers yet. A good reference is a
 slide deck we posted for the meetup in March.

 , or better yet for interested parties to run performance comparisons by
 themselves for now.



 As for status quo of Astro, we have been focusing on fixing bugs
 (UDF-related bug in some coprocessor/custom filter combos), and add support
 of querying string columns in HBase as integers from Astro.



 Thanks,



 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Wednesday, August 12, 2015 7:02 AM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Yan:

 Where can I find performance numbers for Astro (it's close to middle of
 August) ?



 Cheers



 On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com
 wrote:

 Finally I can take a look at HBASE-14181 now. Unfortunately there is no
 design doc mentioned. Superficially it is very similar to Astro with a
 difference of

 this being part of HBase client library; while Astro works as a Spark
 package so will evolve and function more closely with Spark SQL/Dataframe
 instead of HBase.



 In terms of architecture, my take is loosely-coupled query engines on top
 of KV store vs. an array of query engines supported by, and packaged as
 part of, a KV store.



 Functionality-wise the two could be close but Astro also supports Python
 as a result of tight integration with Spark.

 It will be interesting to see performance comparisons when HBase-14181 is
 ready.



 Thanks,





 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Tuesday, August 11, 2015 3:28 PM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro



 HBase will not have query engine.



 It will provide better support to query engines.



 Cheers


 On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Ted,



 I’m in China now, and seem to experience difficulty to access Apache Jira.
 Anyways, it appears to me  that HBASE-14181
 https://issues.apache.org/jira/browse/HBASE-14181 attempts to support
 Spark DataFrame inside HBase.

 If true, one question to me is whether HBase is intended to have a
 built-in query engine or not. Or it will stick with the current way as

 a k-v store with some built-in processing capabilities in the forms of
 coprocessor, custom filter, …, etc., which allows for loosely-coupled query
 engines

 built on top of it.



 Thanks,



 *发件人**:* Ted Yu [mailto:yuzhih...@gmail.com yuzhih...@gmail.com]
 *发送时间**:* 2015年8月11日 8:54
 *收件人**:* Bing Xiao (Bing)
 *抄送**:* dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
 *主题**:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Yan / Bing:

 Mind taking a look at HBASE-14181
 https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark DataFrame
 DataSource to HBase-Spark Module' ?



 Thanks



 On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com
 wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Yan Zhou.sc
We have not “formally” published any numbers yet. A good reference is a slide 
deck we posted for the meetup in March.
, or better yet for interested parties to run performance comparisons by 
themselves for now.

As for status quo of Astro, we have been focusing on fixing bugs (UDF-related 
bug in some coprocessor/custom filter combos), and add support of querying 
string columns in HBase as integers from Astro.

Thanks,

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, August 12, 2015 7:02 AM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
Subject: Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

Yan:
Where can I find performance numbers for Astro (it's close to middle of August) 
?

Cheers

On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
Finally I can take a look at HBASE-14181 now. Unfortunately there is no design 
doc mentioned. Superficially it is very similar to Astro with a difference of
this being part of HBase client library; while Astro works as a Spark package 
so will evolve and function more closely with Spark SQL/Dataframe instead of 
HBase.

In terms of architecture, my take is loosely-coupled query engines on top of KV 
store vs. an array of query engines supported by, and packaged as part of, a KV 
store.

Functionality-wise the two could be close but Astro also supports Python as a 
result of tight integration with Spark.
It will be interesting to see performance comparisons when HBase-14181 is ready.

Thanks,


From: Ted Yu [mailto:yuzhih...@gmail.commailto:yuzhih...@gmail.com]
Sent: Tuesday, August 11, 2015 3:28 PM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev@spark.apache.orgmailto:dev@spark.apache.org; 
u...@spark.apache.orgmailto:u...@spark.apache.org
Subject: Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro

HBase will not have query engine.

It will provide better support to query engines.

Cheers

On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
Ted,

I’m in China now, and seem to experience difficulty to access Apache Jira. 
Anyways, it appears to me  that 
HBASE-14181https://issues.apache.org/jira/browse/HBASE-14181 attempts to 
support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a built-in 
query engine or not. Or it will stick with the current way as
a k-v store with some built-in processing capabilities in the forms of 
coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
engines
built on top of it.

Thanks,

发件人: Ted Yu [mailto:yuzhih...@gmail.com]
发送时间: 2015年8月11日 8:54
收件人: Bing Xiao (Bing)
抄送: dev@spark.apache.orgmailto:dev@spark.apache.org; 
u...@spark.apache.orgmailto:u...@spark.apache.org; Yan Zhou.sc
主题: Re: Package Release Annoucement: Spark SQL on HBase Astro

Yan / Bing:
Mind taking a look at 
HBASE-14181https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark 
DataFrame DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
bing.x...@huawei.commailto:bing.x...@huawei.com wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical

Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-10 Thread Ted Yu
Yan / Bing:
Mind taking a look at HBASE-14181
https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark DataFrame
DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com
wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-03 Thread Ted Yu
When I tried to compile against hbase 1.1.1, I got:

[ERROR]
/home/hbase/ssoh/src/main/scala/org/apache/spark/sql/hbase/SparkSqlRegionObserver.scala:124:
overloaded method next needs result type
[ERROR]   override def next(result: java.util.List[Cell], limit: Int) =
next(result)

Is there plan to support hbase 1.x ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com
wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





RE: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-03 Thread Yan Zhou.sc
HBase 1.0 should work fine even though we have not completed full tests yet. 
Support of 1.1 should be able to be added with a minimal effort.

Thanks,

Yan

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Monday, August 03, 2015 10:33 AM
To: Bing Xiao (Bing)
Cc: dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
Subject: Re: Package Release Annoucement: Spark SQL on HBase Astro

When I tried to compile against hbase 1.1.1, I got:

[ERROR] 
/home/hbase/ssoh/src/main/scala/org/apache/spark/sql/hbase/SparkSqlRegionObserver.scala:124:
 overloaded method next needs result type
[ERROR]   override def next(result: java.util.List[Cell], limit: Int) = 
next(result)

Is there plan to support hbase 1.x ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
bing.x...@huawei.commailto:bing.x...@huawei.com wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team




Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-28 Thread Debasish Das
That's awesome Yan. I was considering Phoenix for SQL calls to HBase since
Cassandra supports CQL but HBase QL support was lacking. I will get back to
you as I start using it on our loads.

I am assuming the latencies won't be much different from accessing HBase
through tsdb asynchbase as that's one more option I am looking into.

On Mon, Jul 27, 2015 at 10:12 PM, Yan Zhou.sc yan.zhou...@huawei.com
wrote:

  HBase in this case is no different from any other Spark SQL data
 sources, so yes you should be able to access HBase data through Astro from
 Spark SQL’s JDBC interface.



 Graphically, the access path is as follows:



 Spark SQL JDBC Interface - Spark SQL Parser/Analyzer/Optimizer-Astro
 Optimizer- HBase Scans/Gets - … - HBase Region server





 Regards,



 Yan



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Monday, July 27, 2015 10:02 PM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev; user
 *Subject:* RE: Package Release Annoucement: Spark SQL on HBase Astro



 Hi Yan,

 Is it possible to access the hbase table through spark sql jdbc layer ?

 Thanks.
 Deb

 On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Yes, but not all SQL-standard insert variants .



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Wednesday, July 22, 2015 7:36 PM
 *To:* Bing Xiao (Bing)
 *Cc:* user; dev; Yan Zhou.sc
 *Subject:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Does it also support insert operations ?

 On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





RE: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-27 Thread Debasish Das
Hi Yan,

Is it possible to access the hbase table through spark sql jdbc layer ?

Thanks.
Deb
On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

  Yes, but not all SQL-standard insert variants .



 *From:* Debasish Das [mailto:debasish.da...@gmail.com]
 *Sent:* Wednesday, July 22, 2015 7:36 PM
 *To:* Bing Xiao (Bing)
 *Cc:* user; dev; Yan Zhou.sc
 *Subject:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Does it also support insert operations ?

 On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





RE: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-27 Thread Yan Zhou.sc
HBase in this case is no different from any other Spark SQL data sources, so 
yes you should be able to access HBase data through Astro from Spark SQL’s JDBC 
interface.

Graphically, the access path is as follows:

Spark SQL JDBC Interface - Spark SQL Parser/Analyzer/Optimizer-Astro 
Optimizer- HBase Scans/Gets - … - HBase Region server


Regards,

Yan

From: Debasish Das [mailto:debasish.da...@gmail.com]
Sent: Monday, July 27, 2015 10:02 PM
To: Yan Zhou.sc
Cc: Bing Xiao (Bing); dev; user
Subject: RE: Package Release Annoucement: Spark SQL on HBase Astro


Hi Yan,

Is it possible to access the hbase table through spark sql jdbc layer ?

Thanks.
Deb
On Jul 22, 2015 9:03 PM, Yan Zhou.sc 
yan.zhou...@huawei.commailto:yan.zhou...@huawei.com wrote:
Yes, but not all SQL-standard insert variants .

From: Debasish Das 
[mailto:debasish.da...@gmail.commailto:debasish.da...@gmail.com]
Sent: Wednesday, July 22, 2015 7:36 PM
To: Bing Xiao (Bing)
Cc: user; dev; Yan Zhou.sc
Subject: Re: Package Release Annoucement: Spark SQL on HBase Astro


Does it also support insert operations ?
On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) 
bing.x...@huawei.commailto:bing.x...@huawei.com wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team



RE: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-22 Thread Yan Zhou.sc
Yes, but not all SQL-standard insert variants .

From: Debasish Das [mailto:debasish.da...@gmail.com]
Sent: Wednesday, July 22, 2015 7:36 PM
To: Bing Xiao (Bing)
Cc: user; dev; Yan Zhou.sc
Subject: Re: Package Release Annoucement: Spark SQL on HBase Astro


Does it also support insert operations ?
On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) 
bing.x...@huawei.commailto:bing.x...@huawei.com wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team



Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-22 Thread Debasish Das
Does it also support insert operations ?
On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:

  We are happy to announce the availability of the Spark SQL on HBase
 1.0.0 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including performance tests the
 middle of August.

 You are very welcomed to try out or deploy the package, and help improve
 the integration tests with various combinations of the settings, extensive
 Data Frame tests, complex join/union test and extensive performance tests.
 Please use the “Issues” “Pull Requests” links at this package homepage, if
 you want to report bugs, improvement or feature requests.

 Special thanks to project owner and technical leader Yan Zhou, Huawei
 global team, community contributors and Databricks.   Databricks has been
 providing great assistance from the design to the release.

 “Astro”, the Spark SQL on HBase package will be useful for ultra low
 latency* query and analytics of large scale data sets in vertical
 enterprises**.* We will continue to work with the community to develop
 new features and improve code base.  Your comments and suggestions are
 greatly appreciated.



 Yan Zhou / Bing Xiao

 Huawei Big Data team





Package Release Annoucement: Spark SQL on HBase Astro

2015-07-22 Thread Bing Xiao (Bing)
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed Astro, include:

* Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

* HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

* SQL, Data Frame support

* More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

* Joins with data from other sources

* Python/Java/Scala support

* Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the Issues Pull Requests links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
Astro, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team