[ANNOUNCE] Apache Phoenix 4.14 released

2018-06-11 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate availability
of the 4.14.0 release. Apache Phoenix enables SQL-based OLTP and
operational analytics for Apache Hadoop using Apache HBase as its backing
store and providing integration with other projects in the Apache ecosystem
such as Spark, Hive, Pig, Flume, and MapReduce.

Highlights of the release include:

* Support for HBase 1.4
* Support for CDH 5.11.2, 5.12.2, 5.13.2, and 5.14.2
* Support for GRANT and REVOKE commands
* Secondary index improvements

For more details, visit our blog here [1] and download source and binaries
here [2].

Thanks,
James (on behalf of the Apache Phoenix team)

[1] https://blogs.apache.org/phoenix/entry/announcing-phoenix-4-14-released
[2] http://phoenix.apache.org/download.html


Re: [ANNOUNCE] Apache Phoenix 4.13.2 for CDH 5.11.2 released

2018-01-20 Thread James Taylor
On Sat, Jan 20, 2018 at 12:29 PM Pedro Boado  wrote:

> The Apache Phoenix team is pleased to announce the immediate availability
> of the 4.13.2 release for CDH 5.11.2. Apache Phoenix enables SQL-based OLTP
> and operational analytics for Apache Hadoop using Apache HBase as its
> backing store and providing integration with other projects in the Apache
> ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. This release is
> compatible with CDH 5.11.2
>
> Highlights of the release include:
>
> * For the first time, support for CDH due to community ask
> * More than 10 fixes over release 4.13.1-HBase-1.2
>
> Source and binary downloads are available here [1]. Folder 'parcels' can
> be directly used as parcel repository from Cloudera Manager ( follow [2] to
> one of Apache mirrors ) .
>
> Thanks,
> Pedro (on behalf of the Apache Phoenix team)
>
> [1] http://phoenix.apache.org/download.html
> [2]
> http://www.apache.org/dyn/closer.lua/phoenix/apache-phoenix-4.13.2-cdh5.11.2/parcels/
> 
>


Re: [ANNOUNCE] Apache Phoenix 4.13 released

2017-11-19 Thread James Taylor
Hi Kumar,
I started a discussion [1][2] on the dev list to find an RM for the HBase
1.2 (and HBase 1.1) branch, but no one initially stepped up, so there were
no plans for a release. Subsequently we've heard from a few folks that they
needed it, and Pedro Boado volunteered to do CDH compatible release
(see PHOENIX-4372) which requires an up to date HBase 1.2 based release.

So I've volunteered to do one more Phoenix 4.13.1 release for HBase 1.2 and
1.1. I'm hoping you, Pedro and others that need 1.2 based releases can
volunteer to be the RM and do further releases.

One thing is clear, though - folks need to be on the dev and user lists so
they can take place in DISCUSS threads.

Thanks,
James

[1]
https://lists.apache.org/thread.html/5b8b44acb1d3608770309767c3cddecbc6484c29452fe6750d8e1516@%3Cdev.phoenix.apache.org%3E
[2]
https://lists.apache.org/thread.html/70cffa798d5f21ef87b02e07aeca8c7982b0b30251411b7be17fadf9@%3Cdev.phoenix.apache.org%3E

On Sun, Nov 19, 2017 at 12:23 PM, Kumar Palaniappan <
kpalaniap...@marinsoftware.com> wrote:

> Are there any plans to release Phoenix 4.13 compatible with HBase 1.2?
>
> On Sat, Nov 11, 2017 at 5:57 PM, James Taylor <jamestay...@apache.org>
> wrote:
>
>> The Apache Phoenix team is pleased to announce the immediate availability
>> of the 4.13.0 release. Apache Phoenix enables SQL-based OLTP and
>> operational analytics for Apache Hadoop using Apache HBase as its backing
>> store and providing integration with other projects in the Apache ecosystem
>> such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are
>> compatible with HBase 0.98 and 1.3.
>>
>> Highlights of the release include:
>>
>> * Critical bug fix to prevent snapshot creation of SYSTEM.CATALOG when
>> connecting [1]
>> * Numerous bug fixes around handling of row deletion [2]
>> * Improvements to statistics collection [3]
>> * New COLLATION_KEY built-in function for linguistic sort [4]
>>
>> Source and binary downloads are available here [5].
>>
>> [1] https://issues.apache.org/jira/browse/PHOENIX-4335
>> [2] https://issues.apache.org/jira/issues/?jql=labels%20%3D%20rowDeletion
>> [3] https://issues.apache.org/jira/issues/?jql=labels%20%3D%
>> 20statsCollection
>> [4] https://phoenix.apache.org/language/functions.html#collation_key
>> [5] http://phoenix.apache.org/download.html
>>
>
>


[ANNOUNCE] Apache Phoenix 4.13 released

2017-11-11 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate availability
of the 4.13.0 release. Apache Phoenix enables SQL-based OLTP and
operational analytics for Apache Hadoop using Apache HBase as its backing
store and providing integration with other projects in the Apache ecosystem
such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are
compatible with HBase 0.98 and 1.3.

Highlights of the release include:

* Critical bug fix to prevent snapshot creation of SYSTEM.CATALOG when
connecting [1]
* Numerous bug fixes around handling of row deletion [2]
* Improvements to statistics collection [3]
* New COLLATION_KEY built-in function for linguistic sort [4]

Source and binary downloads are available here [5].

[1] https://issues.apache.org/jira/browse/PHOENIX-4335
[2] https://issues.apache.org/jira/issues/?jql=labels%20%3D%20rowDeletion
[3]
https://issues.apache.org/jira/issues/?jql=labels%20%3D%20statsCollection
[4] https://phoenix.apache.org/language/functions.html#collation_key
[5] http://phoenix.apache.org/download.html


[ANNOUNCE] Apache Phoenix 4.12 released

2017-10-11 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate availability
of the 4.12.0 release [1]. Apache Phoenix enables SQL-based OLTP and
operational analytics for Apache Hadoop using Apache HBase as its backing
store and providing integration with other projects in the Apache ecosystem
such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are
compatible with HBase 0.98/1.1/1.2/1.3.

Highlights of the release include:

* Improved scalability and reliability of global mutable secondary indexes
[2]
* Implemented index scrutiny tool to validate consistency between index and
data table [3]
* Introduced support for table sampling [4]
* Added support for APPROX_COUNT_DISTINCT aggregate function [5]
* Stabilized unit test runs
* Fixed 100+ issues [6]

Source and binary downloads are available here [1].

[1] http://phoenix.apache.org/download.html
[2]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20PHOENIX%20AND%20status%3DResolved%20AND%20fixVersion%3D%224.12.0%22%20and%20labels%3D%22secondary_index%22
[3] http://phoenix.apache.org/secondary_indexing.html#Index_Scrutiny_Tool
[4] https://phoenix.apache.org/tablesample.html
[5] https://phoenix.apache.org/language/functions.html#approx_count_distinct
[6]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315120=12340844


[ANNOUNCE] Apache Phoenix 4.11 released

2017-07-07 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate availability
of the 4.11.0 release. Apache Phoenix enables SQL-based OLTP and
operational analytics for Apache Hadoop using Apache HBase as its backing
store and providing integration with other projects in the Apache ecosystem
such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are
compatible with HBase 0.98/1.1/1.2/1.3.

Highlights of the release include:

* Support for HBase 1.3.1 and above
* Local index hardening and performance improvements [1]
* Atomic update of data and local index rows (HBase 1.3 only) [2]
* Use of snapshots for MR-based queries and async index building [3][4]
* Support for forward moving cursors [5]
* Chunk commit data from client based on byte size and row count [6]
* Reduce load on region server hosting SYSTEM.CATALOG [7]
* 50+ bug fixes [8]

Source and binary downloads are available here [9].

Thanks,
The Apache Phoenix Team

[1]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20PHOENIX%20and%20fixVersion%3D4.11.0%20and%20labels%3DlocalIndex
[2] https://issues.apache.org/jira/browse/PHOENIX-3827
[3] https://issues.apache.org/jira/browse/PHOENIX-3744
[4] https://issues.apache.org/jira/browse/PHOENIX-3812
[5] https://issues.apache.org/jira/browse/PHOENIX-3572
[6] https://issues.apache.org/jira/browse/PHOENIX-3784
[7] https://issues.apache.org/jira/browse/PHOENIX-3819
[8]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315120=12339764
[9] http://phoenix.apache.org/download.html


[ANNOUNCE] Apache Phoenix 4.10 released

2017-03-23 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate availability
of the 4.10.0 release. Apache Phoenix enables SQL-based OLTP and
operational analytics for Hadoop using Apache HBase as its backing store
and providing integration with other projects in the ecosystem such as
Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are compatible
with HBase 0.98/1.1/1.2.

Highlights of the release include:

* Reduce on disk footprint through column encoding and optimized storage
format for write-once data [1]
* Support Apache Spark 2.0 in Phoenix/Spark integration [2]
* Consume Apache Kafka messages through Phoenix [3]
* Improve UPSERT SELECT performance by distributing execution across
cluster [4]
* Improve Hive integration [5]
* 40+ bug fixes [6]

Source and binary downloads are available here [7].

Thanks,
The Apache Phoenix Team

[1] https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data
[2] https://phoenix.apache.org/phoenix_spark.html
[3] https://phoenix.apache.org/kafka.html
[4] https://issues.apache.org/jira/browse/PHOENIX-3271
[5] https://issues.apache.org/jira/browse/PHOENIX-3346
[6]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315120version=12338126
[7] http://phoenix.apache.org/download.html


[ANNOUNCE] PhoenixCon 2017 is a go!

2017-03-15 Thread James Taylor
I'm excited to announce that the 2nd Annual Apache Phoenix conference,
PhoenixCon 2017 will take place the day after HBaseCon in San Francisco on
Tuesday, June 13th from 10:30am-6pm. For more details, including to RSVP
and submit a talk proposal, click here:
https://www.eventbrite.com/e/phoenixcon-2017-tickets-32872245772

Hope you can make it.

James


[ANNOUNCE] Apache Phoenix 4.9 released

2016-12-01 Thread James Taylor
Apache Phoenix enables OLTP and operational analytics for Apache Hadoop
through SQL support using Apache HBase as its backing store and providing
integration with other projects in the ecosystem such as Apache Spark,
Apache Hive, Apache Pig, Apache Flume, and Apache MapReduce.

We're pleased to announce our 4.9.0 release which includes:
- Atomic UPSERT through new ON DUPLICATE KEY syntax [1]
- Support for DEFAULT declaration in DDL statements [2]
- Specify guidepost width per table [3]
- Over 40 bugs fixed [4]

The release is available in source or binary form here [5].

Thanks,
The Apache Phoenix Team

[1] https://phoenix.apache.org/atomic_upsert.html
[2] https://phoenix.apache.org/language/index.html#column_def
[3] https://phoenix.apache.org/update_statistics.html
[4]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315120=12335845
[5] https://phoenix.apache.org/download.html


[ANNOUNCE] PhoenixCon the day after HBaseCon

2016-05-19 Thread James Taylor
The inaugural PhoenixCon will take place 9am-1pm on Wed, May 25th (at
Salesforce @ 1 Market St, SF), the day after HBaseCon. We'll have two
tracks: one for Apache Phoenix use cases and one for Apache Phoenix
internals.

To RSVP and for more details see here[1].

We hope you can make it!

James

[1]
http://www.meetup.com/SF-Bay-Area-Apache-Phoenix-Meetup/events/230545182/


Re: [ANNOUNCE] PhoenixCon 2016 on Wed, May 25th 9am-1pm

2016-04-27 Thread James Taylor
Yes, that sounds great - please let me know when I can add you to the
agenda.

James

On Tuesday, April 26, 2016, Anil Gupta <anilgupt...@gmail.com> wrote:

> Hi James,
> I spoke to my manager and he is fine with the idea of giving the talk.
> Now, he is gonna ask higher management for final approval. I am assuming
> there is still a slot for my talk in use case srction. I should go ahead
> with my approval process. Correct?
>
> Thanks,
> Anil Gupta
> Sent from my iPhone
>
> > On Apr 26, 2016, at 5:56 PM, James Taylor <jamestay...@apache.org
> <javascript:;>> wrote:
> >
> > We invite you to attend the inaugural PhoenixCon on Wed, May 25th 9am-1pm
> > (the day after HBaseCon) hosted by Salesforce.com in San Francisco. There
> > will be two tracks: one for use cases and one for internals. Drop me a
> note
> > if you're interested in giving a talk. To RSVP and for more details, see
> > here[1].
> >
> > Thanks,
> > James
> >
> > [1]
> http://www.meetup.com/SF-Bay-Area-Apache-Phoenix-Meetup/events/230545182
>


[ANNOUNCE] PhoenixCon 2016 on Wed, May 25th 9am-1pm

2016-04-26 Thread James Taylor
We invite you to attend the inaugural PhoenixCon on Wed, May 25th 9am-1pm
(the day after HBaseCon) hosted by Salesforce.com in San Francisco. There
will be two tracks: one for use cases and one for internals. Drop me a note
if you're interested in giving a talk. To RSVP and for more details, see
here[1].

Thanks,
James

[1] http://www.meetup.com/SF-Bay-Area-Apache-Phoenix-Meetup/events/230545182


[ANNOUNCE] Apache Phoenix 4.5 released

2015-08-05 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate availability
of the 4.5 release with support for HBase 0.98/1.0/1.1.

Together with the 4.4 release, highlights include:

Spark Integration (4.4) [1]
User Defined Functions (4.4) [2]
Query Server with thin driver (4.4) [3]
Pherf tool for performance and functional testing at scale (4.4) [4]
Asynchronous index population through MR based index builder (4.5) [5]
Collection of client-side metrics aggregated per statement (4.5) [6]
Improvements to modeling through VIEWs (4.5) [7][8]

More details of the release may be found here [9] and the release may be
downloaded here [10].

Regards,
The Apache Phoenix Team

[1] http://phoenix.apache.org/phoenix_spark.html
[2] http://phoenix.apache.org/udf.html
[3] http://phoenix.apache.org/server.html
[4] http://phoenix.apache.org/pherf.html
[5]
http://phoenix.apache.org/secondary_indexing.html#Asynchronous_Index_Population
[6] https://issues.apache.org/jira/browse/PHOENIX-1819
[7] https://issues.apache.org/jira/browse/PHOENIX-1504
[8] https://issues.apache.org/jira/browse/PHOENIX-978
[9] https://blogs.apache.org/phoenix/entry/announcing_phoenix_4_5_released
[10] http://phoenix.apache.org/download.html


[ANNOUNCE] Apache Phoenix 4.3 released

2015-02-25 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate
availability of the 4.3 release. Highlights include:

- functional indexes [1]
- map-reduce over Phoenix tables [2]
- cross join support [3]
- query hint to force index usage [4]
- set HBase properties through ALTER TABLE
- ISO-8601 date format support on input
- RAND built-in for random number generation
- ANSI SQL date/time literals
- query timeout support in JDBC Statement
- over 90 bug fixes

The release is available through maven or may be downloaded here [5].

Regards,
The Apache Phoenix Team

[1] http://phoenix.apache.org/secondary_indexing.html#Functional_Indexes
[2] http://phoenix.apache.org/phoenix_mr.html
[3] http://phoenix.apache.org/joins.html
[4] http://phoenix.apache.org/secondary_indexing.html#Examples
[5] http://phoenix.apache.org/download.html


[ANNOUNCE] Apache Phoenix meetup in SF on Tue, Feb 24th

2015-01-22 Thread James Taylor
I'm excited to announce the first ever Apache Phoenix meetup, hosted
by salesforce.com in San Francisco on Tuesday, February 24th @ 6pm.
More details here:
http://www.meetup.com/San-Francisco-Apache-Phoenix-Meetup/events/220009583/

Please ping me if you're interested in presenting your companies use
case. We'll have live streaming available for remote participants as
well.

Thanks,
James


[ANNOUNCE] Apache Phoenix 4.2.2 and 3.2.2 released

2014-12-10 Thread James Taylor
The Apache Phoenix team is pleased to announce the immediate
availability of the 4.2.2/3.2.2 release. For details of the release,
see our release announcement[1].

The Apache Phoenix team

[1] https://blogs.apache.org/phoenix/entry/announcing_phoenix_4_2_2


Re: Connecting Hbase to Elasticsearch with Phoenix

2014-09-10 Thread James Taylor
+1. Thanks, Alex. I added a blog pointing folks there as well:
https://blogs.apache.org/phoenix/entry/connecting_hbase_to_elasticsearch_through

On Wed, Sep 10, 2014 at 2:12 PM, Andrew Purtell apurt...@apache.org wrote:
 Thanks for writing in with this pointer Alex!

 On Wed, Sep 10, 2014 at 11:11 AM, Alex Kamil alex.ka...@gmail.com wrote:
 I posted step-by-step instructions here on using Apache Hbase/Phoenix with
 Elasticsearch JDBC River.

 This might be useful to Elasticsearch users who want to use Hbase as a
 primary data store, and to Hbase users who wish to enable full-text search
 on their existing tables via Elasticsearch API.

 Alex



 --
 Best regards,

- Andy

 Problems worthy of attack prove their worth by hitting back. - Piet
 Hein (via Tom White)


[ANNOUNCE] Apache Phoenix 3.1 and 4.1 released

2014-09-01 Thread James Taylor
Hello everyone,

On behalf of the Apache Phoenix [1] project, a SQL database on top of
HBase, I'm pleased to announce the immediate availability of our 3.1
and 4.1 releases [2].

These include many bug fixes along with support for nested/derived
tables, tracing, and local indexing. For details of the release,
please see our announcement [3].

Regards,
The Apache Phoenix team

[1] http://phoenix.apache.org/
[2] http://phoenix.apache.org/download.html
[3] https://blogs.apache.org/phoenix/entry/announcing_phoenix_3_1_and


Re: Region not assigned

2014-08-14 Thread James Taylor
On the first connection to the cluster when you've installed Phoenix
2.2.3 and were previously using Phoenix 2.2.2, Phoenix will upgrade
your Phoenix tables to use the new coprocessor names
(org.apache.phoenix.*) instead of the old coprocessor names
(com.salesforce.phoenix.*).

Thanks,
James

On Thu, Aug 14, 2014 at 8:38 AM, Ted Yu yuzhih...@gmail.com wrote:
 Adding Phoenix dev@



 On Thu, Aug 14, 2014 at 8:05 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:

 It seems that the region servers are complaining about wrong phoenix
 classes for some reason. We are running 2.2.0 which is the version before
 phoenix was moved to apache.

 But looking at the regionserver logs are stuck complaining about
 org.apache.phoenix.coprocessor.MetaDataEndpointImpl which IS the apache
 version? We might have connected with a newer client - but how could this
 trigger this?


 2014-08-14 17:01:40,052 DEBUG
 org.apache.hadoop.hbase.coprocessor.CoprocessorHost: Loading coprocessor
 class org.apache.phoenix.coprocessor.ServerCachingEndpointImpl with path
 null and priority 1
 2014-08-14 17:01:40,053 WARN
 org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost: attribute
 'coprocessor$12' has invalid coprocessor specification
 '|org.apache.phoenix.coprocessor.ServerCachingEndpointImpl|1|'
 2014-08-14 17:01:40,053 WARN
 org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost:
 java.io.IOException: No jar path specified for
 org.apache.phoenix.coprocessor.ServerCachingEndpointImpl
 at

 org.apache.hadoop.hbase.coprocessor.CoprocessorHost.load(CoprocessorHost.java:183)
 at

 org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.loadTableCoprocessors(RegionCoprocessorHost.java:190)
 at

 org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.init(RegionCoprocessorHost.java:154)
 at org.apache.hadoop.hbase.regionserver.HRegion.init(HRegion.java:474)
 at sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source)
 at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at
 org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:4084)
 at
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4267)
 at

 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:329)
 at

 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
 at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)

 2014-08-14 17:01:40,053 DEBUG
 org.apache.hadoop.hbase.coprocessor.CoprocessorHost: Loading coprocessor
 class org.apache.phoenix.coprocessor.MetaDataEndpointImpl with path null
 and priority 1
 2014-08-14 17:01:40,053 WARN
 org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost: attribute
 'coprocessor$13' has invalid coprocessor specification
 '|org.apache.phoenix.coprocessor.MetaDataEndpointImpl|1|'
 2014-08-14 17:01:40,053 WARN
 org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost:
 java.io.IOException: No jar path specified for
 org.apache.phoenix.coprocessor.MetaDataEndpointImpl
 at

 org.apache.hadoop.hbase.coprocessor.CoprocessorHost.load(CoprocessorHost.java:183)
 at

 org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.loadTableCoprocessors(RegionCoprocessorHost.java:190)
 at

 org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.init(RegionCoprocessorHost.java:154)
 at org.apache.hadoop.hbase.regionserver.HRegion.init(HRegion.java:474)
 at sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source)
 at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at
 org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:4084)
 at
 org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4267)
 at

 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:329)
 at

 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
 at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)



 On Thu, Aug 14, 2014 at 4:31 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:

  Hi
 
  We are running hbase 0.94.6 cdh 4.4 and have a problem with one table not
  being assigned to any region. This is the SYSTEM.TABLE in Phoenix so all
  tables are basically non functional at the moment.
 
  When 

Re: Copy some records from Huge hbase table to another table

2014-05-23 Thread James Taylor
Hi Riyaz,
You can do this with a single SQL command using Apache Phoenix, a SQL
engine on top of HBase, and you'll get better performance than if you hand
coded it using the HBase client APIs. Depending on your current schema, you
may be able to run this command with no change to your data. Let's assume
you have an MD5 hash of the website and the date/time in the row key with
your website and counts in key values. That schema could be modeled like
this in Phoenix:

CREATE VIEW WEBSITE_STATS (
WEBSITE_MD5 BINARY(16) NOT NULL,
DATE_COLLECTED UNSIGNED_DATE NOT NULL,
WEBSITE_URL VARCHAR,
HIT_COUNT UNSIGNED_LONG,
CONSTRAINT PK PRIMARY KEY (WEBSITE_MD5, DATE_COLLECTED));

You could issue this create view statement and map directly to your HBase
table. I used the UNSIGNED types above as they match the serialization you
get when you use the HBase Bytes utility methods. Phoenix normalizes column
names by upper casing them, so if your column qualifiers are lower case,
you'd want to put the column names above in double quotes.

Next, you'd create a table to hold the top10 information:

CREATE TABLE WEBSITE_TOP10 (
WEBSITE_URL VARCHAR PRIMARY KEY,
TOTAL_HIT_COUNT BIGINT);

Then you'd run an UPSERT SELECT command like this:

UPSERT INTO WEBSITE_TOP10
SELECT WEBSITE_URL, SUM(HIT_COUNT) FROM WEBSITE_STATS
GROUP BY WEBSITE_URL
ORDER BY SUM(HIT_COUNT)
LIMIT 10;

Phoenix will run the SELECT part of this in parallel on the client and use
a coprocessor on the server side to aggregate over the WEBSITE_URLs
returning the distinct set of urls per region with a final merge sort
happening o the client to compute the total sum. Then the client will hold
on to the top 10 rows it sees and upsert these into the WEBSITE_TOP10 table.

HTH,
James






On Fri, May 23, 2014 at 5:14 PM, Ted Yu yuzhih...@gmail.com wrote:

 Would the new HBase table reside in the same cluster as the original table
 ?

 See this recent thread: http://search-hadoop.com/m/DHED4uBNqJ1

 Cheers


 On Fri, May 23, 2014 at 2:49 PM, Shaikh Ahmed rnsr.sha...@gmail.com
 wrote:

  Hi,
 
  We have one huge HBase table with billions of rows. This table holds the
  information about websites and number of hits on that site in every 15
  minutes.
 
  Every website will have multiple records in data with different number of
  hit count and last updated timestamp.
 
  Now, we want to create another Hbase table which will contain information
  about only those TOP 10 websites which are having more number of hits.
 
  We are seeking help from experts to achieve this requirement.
  How we can filter top 10 websites based on hits count from billions of
  records and copy it into our new HBase table?
 
  I will greatly appreciate kind support from group members.
 
  Thanks in advance.
  Regards,
  Riyaz
 



[ANNOUNCE] Apache Phoenix has graduated as a top level project

2014-05-22 Thread James Taylor
I'm pleased to announce that Apache Phoenix has graduated from the
incubator to become a top level project. Thanks so much for all your help
and support - we couldn't have done it without the fantastic HBase
community! We're looking forward to continued collaboration.
Regards,
The Apache Phoenix team


Re: hbase key design to efficient query on base of 2 or more column

2014-05-19 Thread James Taylor
If you use Phoenix, queries would leverage our Skip Scan:
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html

Assuming a row key made up of a low cardinality first value (like a byte
representing an enum), followed by a high cardinality second value (like a
date/time value) you'd get a large benefit from the skip scan when you're
only looking a small sliver of your time range.

Another option would be to create a secondary index over your second+first
column: http://phoenix.incubator.apache.org/secondary_indexing.html

Thanks,
James

On May 19, 2014, at 6:47 AM, Shushant Arora shushantaror...@gmail.com
wrote:

Ok..but what if I have 2 multivalue dimensions on which I have to analyse
no of users. Say Category can have 50 values and another dimension is
country of user(say 100+ values). I need weekly count on category and
country + I need overall distinct user count on category and country.

How to achieve this in Hbase.


On Mon, May 19, 2014 at 3:11 PM, Michael Segel michael_se...@hotmail.com
wrote:

The point is that choosing a field that has a small finite set of values

is not a good candidate for indexing using an inverted table or b-tree etc …


I’d say that you’re actually going to be better off using a scan with a

start and stop row, then doing the counts on the client side.


So as you get back your result set… you process the data. (Either in a M/R

job or single client thread.)


HTH


On May 19, 2014, at 8:48 AM, Shushant Arora shushantaror...@gmail.com

wrote:


I cannot apply server side filter.

2nd requirement is not just get users with supreme category rather

distribution of users category wise.


1.How many of supreme , how many of normal and how many of medium till

date.



On Mon, May 19, 2014 at 12:58 PM, Michael Segel

michael_se...@hotmail.comwrote:


Whoa!


BAD BOY. This isn’t a good idea for secondary index.


You have a row key (primary index) which is time.

The secondary is a filter… with 3 choices.


HINT: Do you really want a secondary index based on a field that only

has

3 choices for a value?


What are they teaching in school these days?


How about applying a server side filter?  ;-)




On May 18, 2014, at 12:33 PM, John Hancock jhancock1...@gmail.com

wrote:


Shushant,


Here's one idea, there might be better ways.


Take a look at phoenix it supports secondary indexing:

http://phoenix.incubator.apache.org/secondary_indexing.html


-John



On Sat, May 17, 2014 at 8:34 AM, Shushant Arora

shushantaror...@gmail.comwrote:


Hi


I have a requirement to query my data base on date and user category.

User category can be Supreme,Normal,Medium.


I want to query how many new users are there in my table from date

range

(2014-01-01) to (2014-05-16) category wise.


Another requirement is to query how many users of Supreme category are

there in my table Broken down wise month in which they came.


What should be my key

1.If i take key as combination of date#category. I cannot query based

on

category?

2.If I take key as category#date I cannot query based on date.



Thanks

Shushant.


Re: Questions on FuzzyRowFilter

2014-05-18 Thread James Taylor
@Mike,

The biggest problem is you're not listening. Please actually read my
response (and you'll understand the what we're calling salting is not a
random seed).

Phoenix already has secondary indexes in two flavors: one optimized for
write-once data and one more general for fully mutable data. Soon we'll
have a third for local indexing.

James


On Sun, May 18, 2014 at 10:27 AM, Michael Segel
michael_se...@hotmail.comwrote:

 @James,

 I know and that’s the biggest problem.
 Salts by definition are random seeds.

 Now I have two new phrases.

 1) We want to remain on a sodium free diet.
 2) Learn to kick the bucket.

 When you have data that is coming in on a time series, is the data mutable
 or not?

 A better approach would be to redesign a second type of storage to handle
 serial data and how the regions are split and managed.
 Or just not use HBase to store the underlying data in the first place and
 just store the index… ;-)
 (Yes, I thought about this too.)

 -Mike

 On May 16, 2014, at 7:50 PM, James Taylor jtay...@salesforce.com wrote:

  Hi Mike,
  I agree with you - the way you've outlined is exactly the way Phoenix has
  implemented it. It's a bit of a problem with terminology, though. We call
  it salting: http://phoenix.incubator.apache.org/salted.html. We hash the
  key, mod the hash with the SALT_BUCKET value you provide, and prepend the
  row key with this single byte value. Maybe you can coin a good term for
  this technique?
 
  FWIW, you don't lose the ability to do a range scan when you salt (or
  hash-the-key and mod by the number of buckets), but you do need to run
 a
  scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then
  the client does a merge sort among these scans. It performs well.
 
  Thanks,
  James
 
 
  On Fri, May 9, 2014 at 11:57 PM, Michael Segel 
 michael_se...@hotmail.comwrote:
 
  3+ Years on and a bad idea is being propagated again.
 
  Now repeat after me… DO NO USE A SALT.
 
  Having a low sodium diet, especially for HBase is really good for your
  health and sanity.
 
  The salt is going to be orthogonal to the row key (Key).
  There is no relationship to the specific Key.
 
  Using a salt means you now use the ability to randomly spread the
  distribution of data to avoid HOT SPOTTING.
  However you lose the ability to seek for a specific row.
 
  YOU HASH THE KEY.
 
  The hash whether you use SHA-1 or MD-5 is going to yield the same result
  each and every time you provide the key.
 
  But wait, the generated hash is 160 bits long. We don’t need that!
  Absolutely true if you just want to randomize the key to avoid hot
  spotting. There’s this concept called truncating the hash to the desired
  length.
  So to Adrien’s point, you can truncate it to a single byte which would
 be
  sufficient….
  Now when you want to seek for a specific row, you can find it.
 
  The downside to any solution is that you lose the ability to do a range
  scan.
  BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A
  SINGLE ROW VIA A get() CALL.
 
  rant
  This simple fact has been pointed out several years ago, yet for some
  reason, the use of a salt persists.
  I’ve actually made that part of the HBase course I wrote and use it in
 my
  presentation(s) on HBase.
 
  It amazes me that the committers and regulars who post here still don’t
  grok the fact that if you’re going to ‘SALT’ a row, you might as well
 not
  use HBase and stick with Hive.
  I remember Ed C’s rant about how preferential treatment on Hive patches
  was given to vendors’ committers… that preferential treatment seems to
 also
  be extended speakers at conferences. It wouldn’t be a problem if those
 said
  speakers actually knew the topic… ;-)
 
  Propagation of bad ideas means that you’re leaving a lot of performance
 on
  the table and it can kill or cripple projects.
 
  /rant
 
  Sorry for the rant…
 
  -Mike
 
 
 
 
  On May 3, 2014, at 4:39 PM, Software Dev static.void@gmail.com
  wrote:
 
  Ok so there is no way around the FuzzyRowFilter checking every single
  row in the table correct? If so, what is a valid use case for that
  filter?
 
  Ok so salt to a low enough prefix that makes scanning reasonable. Our
  client for accessing these tables is a Rails (not JRuby) application
  so we are stuck with either the Thrift or Rails client. Can either of
  these perform multiple gets/scans?
 
 
 
  On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet 
 adrien.moge...@gmail.com
  wrote:
  Using 4 random bytes you'll get 2^32 possibilities; thus your data can
  be
  split enough among all the possible regions, but you won't be able to
  easily benefit from distributed scans to gather what you want.
 
  Let say you want to split (time+login) with a salted key and you
 expect
  to
  be able to retrieve events from 20140429 pretty fast. Then I would
 split
  input data among 10 spans, spread over 10 regions and 10 RS (ie:
  `$random
  % 10'). To retrieve ordered data, I would

Re: Questions on FuzzyRowFilter

2014-05-18 Thread James Taylor
The top two hits when you Google  for HBase salt are
- Sematext blog describing salting as I described it in my email
- Phoenix blog again describing salting in this same way
I really don't understand what you're arguing about - the mechanism that
you're advocating for is exactly the way both these solutions have
implemented it. I believe we're all in agreement. It seems that you just
aren't happy with the fact that we've called this technique salting.


On Sun, May 18, 2014 at 11:32 AM, Michael Segel
michael_se...@hotmail.comwrote:

 @James…
 You’re not listening. There is a special meaning when you say salt.

 On May 18, 2014, at 7:16 PM, James Taylor jtay...@salesforce.com wrote:

  @Mike,
 
  The biggest problem is you're not listening. Please actually read my
  response (and you'll understand the what we're calling salting is not a
  random seed).
 
  Phoenix already has secondary indexes in two flavors: one optimized for
  write-once data and one more general for fully mutable data. Soon we'll
  have a third for local indexing.
 
  James
 
 
  On Sun, May 18, 2014 at 10:27 AM, Michael Segel
  michael_se...@hotmail.comwrote:
 
  @James,
 
  I know and that’s the biggest problem.
  Salts by definition are random seeds.
 
  Now I have two new phrases.
 
  1) We want to remain on a sodium free diet.
  2) Learn to kick the bucket.
 
  When you have data that is coming in on a time series, is the data
 mutable
  or not?
 
  A better approach would be to redesign a second type of storage to
 handle
  serial data and how the regions are split and managed.
  Or just not use HBase to store the underlying data in the first place
 and
  just store the index… ;-)
  (Yes, I thought about this too.)
 
  -Mike
 
  On May 16, 2014, at 7:50 PM, James Taylor jtay...@salesforce.com
 wrote:
 
  Hi Mike,
  I agree with you - the way you've outlined is exactly the way Phoenix
 has
  implemented it. It's a bit of a problem with terminology, though. We
 call
  it salting: http://phoenix.incubator.apache.org/salted.html. We hash
 the
  key, mod the hash with the SALT_BUCKET value you provide, and prepend
 the
  row key with this single byte value. Maybe you can coin a good term for
  this technique?
 
  FWIW, you don't lose the ability to do a range scan when you salt (or
  hash-the-key and mod by the number of buckets), but you do need to
 run
  a
  scan for each possible value of your salt byte (0 - SALT_BUCKET-1).
 Then
  the client does a merge sort among these scans. It performs well.
 
  Thanks,
  James
 
 
  On Fri, May 9, 2014 at 11:57 PM, Michael Segel 
  michael_se...@hotmail.comwrote:
 
  3+ Years on and a bad idea is being propagated again.
 
  Now repeat after me… DO NO USE A SALT.
 
  Having a low sodium diet, especially for HBase is really good for your
  health and sanity.
 
  The salt is going to be orthogonal to the row key (Key).
  There is no relationship to the specific Key.
 
  Using a salt means you now use the ability to randomly spread the
  distribution of data to avoid HOT SPOTTING.
  However you lose the ability to seek for a specific row.
 
  YOU HASH THE KEY.
 
  The hash whether you use SHA-1 or MD-5 is going to yield the same
 result
  each and every time you provide the key.
 
  But wait, the generated hash is 160 bits long. We don’t need that!
  Absolutely true if you just want to randomize the key to avoid hot
  spotting. There’s this concept called truncating the hash to the
 desired
  length.
  So to Adrien’s point, you can truncate it to a single byte which would
  be
  sufficient….
  Now when you want to seek for a specific row, you can find it.
 
  The downside to any solution is that you lose the ability to do a
 range
  scan.
  BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO
 FETCH A
  SINGLE ROW VIA A get() CALL.
 
  rant
  This simple fact has been pointed out several years ago, yet for some
  reason, the use of a salt persists.
  I’ve actually made that part of the HBase course I wrote and use it in
  my
  presentation(s) on HBase.
 
  It amazes me that the committers and regulars who post here still
 don’t
  grok the fact that if you’re going to ‘SALT’ a row, you might as well
  not
  use HBase and stick with Hive.
  I remember Ed C’s rant about how preferential treatment on Hive
 patches
  was given to vendors’ committers… that preferential treatment seems to
  also
  be extended speakers at conferences. It wouldn’t be a problem if those
  said
  speakers actually knew the topic… ;-)
 
  Propagation of bad ideas means that you’re leaving a lot of
 performance
  on
  the table and it can kill or cripple projects.
 
  /rant
 
  Sorry for the rant…
 
  -Mike
 
 
 
 
  On May 3, 2014, at 4:39 PM, Software Dev static.void@gmail.com
  wrote:
 
  Ok so there is no way around the FuzzyRowFilter checking every single
  row in the table correct? If so, what is a valid use case for that
  filter?
 
  Ok so salt to a low enough prefix that makes scanning reasonable. Our

Re: Questions on FuzzyRowFilter

2014-05-18 Thread James Taylor
@Software Dev - if you use Phoenix, queries would leverage our Skip Scan
(which supports a superset of the FuzzyRowFilter perf improvements). Take a
look here:
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html

Assuming a row key made up of a low cardinality first value (like a byte
representing an enum), followed by a high cardinality second value (like a
date/time value) you'd get a large benefit from the skip scan when you're
only looking a small sliver of your time range.

Another option would be to create a secondary index over your date:
http://phoenix.incubator.apache.org/secondary_indexing.html

Thanks,
James


On Sun, May 18, 2014 at 1:56 PM, James Taylor jtay...@salesforce.comwrote:

 The top two hits when you Google  for HBase salt are
 - Sematext blog describing salting as I described it in my email
 - Phoenix blog again describing salting in this same way
 I really don't understand what you're arguing about - the mechanism that
 you're advocating for is exactly the way both these solutions have
 implemented it. I believe we're all in agreement. It seems that you just
 aren't happy with the fact that we've called this technique salting.


 On Sun, May 18, 2014 at 11:32 AM, Michael Segel michael_se...@hotmail.com
  wrote:

 @James…
 You’re not listening. There is a special meaning when you say salt.

 On May 18, 2014, at 7:16 PM, James Taylor jtay...@salesforce.com wrote:

  @Mike,
 
  The biggest problem is you're not listening. Please actually read my
  response (and you'll understand the what we're calling salting is not
 a
  random seed).
 
  Phoenix already has secondary indexes in two flavors: one optimized for
  write-once data and one more general for fully mutable data. Soon we'll
  have a third for local indexing.
 
  James
 
 
  On Sun, May 18, 2014 at 10:27 AM, Michael Segel
  michael_se...@hotmail.comwrote:
 
  @James,
 
  I know and that’s the biggest problem.
  Salts by definition are random seeds.
 
  Now I have two new phrases.
 
  1) We want to remain on a sodium free diet.
  2) Learn to kick the bucket.
 
  When you have data that is coming in on a time series, is the data
 mutable
  or not?
 
  A better approach would be to redesign a second type of storage to
 handle
  serial data and how the regions are split and managed.
  Or just not use HBase to store the underlying data in the first place
 and
  just store the index… ;-)
  (Yes, I thought about this too.)
 
  -Mike
 
  On May 16, 2014, at 7:50 PM, James Taylor jtay...@salesforce.com
 wrote:
 
  Hi Mike,
  I agree with you - the way you've outlined is exactly the way Phoenix
 has
  implemented it. It's a bit of a problem with terminology, though. We
 call
  it salting: http://phoenix.incubator.apache.org/salted.html. We hash
 the
  key, mod the hash with the SALT_BUCKET value you provide, and prepend
 the
  row key with this single byte value. Maybe you can coin a good term
 for
  this technique?
 
  FWIW, you don't lose the ability to do a range scan when you salt (or
  hash-the-key and mod by the number of buckets), but you do need to
 run
  a
  scan for each possible value of your salt byte (0 - SALT_BUCKET-1).
 Then
  the client does a merge sort among these scans. It performs well.
 
  Thanks,
  James
 
 
  On Fri, May 9, 2014 at 11:57 PM, Michael Segel 
  michael_se...@hotmail.comwrote:
 
  3+ Years on and a bad idea is being propagated again.
 
  Now repeat after me… DO NO USE A SALT.
 
  Having a low sodium diet, especially for HBase is really good for
 your
  health and sanity.
 
  The salt is going to be orthogonal to the row key (Key).
  There is no relationship to the specific Key.
 
  Using a salt means you now use the ability to randomly spread the
  distribution of data to avoid HOT SPOTTING.
  However you lose the ability to seek for a specific row.
 
  YOU HASH THE KEY.
 
  The hash whether you use SHA-1 or MD-5 is going to yield the same
 result
  each and every time you provide the key.
 
  But wait, the generated hash is 160 bits long. We don’t need that!
  Absolutely true if you just want to randomize the key to avoid hot
  spotting. There’s this concept called truncating the hash to the
 desired
  length.
  So to Adrien’s point, you can truncate it to a single byte which
 would
  be
  sufficient….
  Now when you want to seek for a specific row, you can find it.
 
  The downside to any solution is that you lose the ability to do a
 range
  scan.
  BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO
 FETCH A
  SINGLE ROW VIA A get() CALL.
 
  rant
  This simple fact has been pointed out several years ago, yet for some
  reason, the use of a salt persists.
  I’ve actually made that part of the HBase course I wrote and use it
 in
  my
  presentation(s) on HBase.
 
  It amazes me that the committers and regulars who post here still
 don’t
  grok the fact that if you’re going to ‘SALT’ a row, you might as well
  not
  use HBase and stick with Hive.
  I remember

Re: Prefix salting pattern

2014-05-18 Thread James Taylor
@Software Dev - might be feasible to implement a Thrift client that speaks
Phoenix JDBC. I believe this is similar to what Hive has done.
Thanks,
James


On Sun, May 18, 2014 at 1:19 PM, Mike Axiak m...@axiak.net wrote:

 In our measurements, scanning is improved by performing against n
 range scans rather than 1 (since you are effectively striping the
 reads). This is even better when you don't necessary care about the
 order of every row, but want every row in a given range (then you can
 just get whatever row is available from a buffer in the client).

 -Mike

 On Sun, May 18, 2014 at 1:07 PM, Michael Segel
 michael_se...@hotmail.com wrote:
  No, you’re missing the point.
  Its not a good idea or design.
 
  Is your data mutable or static?
 
  To your point. Everytime you want to do a simple get() you have to open
 up n get() statements. On your range scans you will have to do n range
 scans, then join and sort the result sets. The fact that each result set is
 in sort order will help a little, but still not that clean.
 
 
 
  On May 18, 2014, at 4:58 PM, Software Dev static.void@gmail.com
 wrote:
 
  You may be missing the point. The primary reason for the salt prefix
  pattern is to avoid hotspotting when inserting time series data AND at
  the same time provide a way to perform range scans.
 
 http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
 
  NOTE:  Many people worry about hot spotting when they really don’t
 have to do so. Hot spotting that occurs on a the initial load of a table is
 .OK. Its when you have a sequential row key that you run in to problems
 with hot spotting and regions being only half filled.
 
  The data being inserted will be a constant stream of time ordered data
  so yes, hotspotting will be an issue
 
  Adding a random value to give you a bit of randomness now means that
 you can’t do a range scan..
 
  That's not accurate. To perform a range scan you would just need to
  open up N scanners where N is the size of the buckets/random prefixes
  used.
 
  Don’t take the modulo, just truncate to the first byte.  Taking the
 modulo is again a dumb idea, but not as dumb as using a salt.
 
  Well the only reason why I would think using a salt would be
  beneficial is to limit the number of scanners when performing a range
  scan. See above comment. And yes, performing a range scan will be our
  primary read pattern.
 
  On Sun, May 18, 2014 at 2:36 AM, Michael Segel
  michael_se...@hotmail.com wrote:
  I think I should dust off my schema design talk… clearly the talks
 given by some of the vendors don’t really explain things …
  (Hmmm. Strata London?)
 
  See my reply below…. Note I used SHA-1. MD-5 should also give you
 roughly the same results.
 
  On May 18, 2014, at 4:28 AM, Software Dev static.void@gmail.com
 wrote:
 
  I recently came across the pattern of adding a salting prefix to the
  row keys to prevent hotspotting. Still trying to wrap my head around
  it and I have a few questions.
 
 
  If you add a salt, you’re prepending a random number to a row in order
 to avoid hot spotting.  It amazes me that Sematext never went back and
 either removed the blog or fixed it and now the bad idea is getting
 propagated.  Adding a random value to give you a bit of randomness now
 means that you can’t do a range scan, or fetch the specific row with a
 single get()  so you’re going to end up boiling the ocean to get your data.
 You’re better off using hive/spark/shark than hbase.
 
  As James tries to point out, you take the hash of the row so that you
 can easily retrieve the value. But rather than prepend a 160 bit hash, you
 can easily achieve the same thing by just truncating the hash to the first
 byte in order to get enough randomness to avoid hot spotting. Of course,
 the one question you should ask is why don’t you just take the hash as the
 row key and then have a 160 bit row key (40 bytes in length)? Then store
 the actual key as a column in the table.
 
  And then there’s a bigger question… why are you worried about hot
 spotting? Are you adding rows where the row key is sequential?  Or are you
 worried about when you first start loading rows, that you are hot spotting,
 but the underlying row key is random enough that once the first set of rows
 are added, HBase splitting regions will be enough?
 
  - Is there ever a reason to salt to more buckets than there are region
  servers? The only reason why I think that may be beneficial is to
  anticipate future growth???
 
  Doesn’t matter.
  Think about how HBase splits regions.
  Don’t take the modulo, just truncate to the first byte.  Taking the
 modulo is again a dumb idea, but not as dumb as using a salt.
 
  Keep in mind that the first byte of the hash is going to be 0-f in a
 character representation. (4 bits of the 160bit key)  So you have 16 values
 to start with.
  That should be enough.
 
  - Is it beneficial to always hash against a 

Re: Prefix salting pattern

2014-05-17 Thread James Taylor
No, there's nothing wrong with your thinking. That's exactly what Phoenix
does - use the modulo of the hash of the key. It's important that you can
calculate the prefix byte so that you can still do fast point lookups.

Using a modulo that's bigger than the number of region servers can make
sense as well (up to the overall number of cores in your cluster). You
can't change the modulo without rewriting the data, so factoring in future
growth makes sense.

Thanks,
James


On Sat, May 17, 2014 at 8:50 PM, Software Dev static.void@gmail.comwrote:

 Well kept reading on this subject and realized my second question may
 not be appropriate since this prefix salting pattern assumes that the
 prefix is random. I thought it was actually based off a hash that
 could be predetermined so you could alwasy, if needed, get to the
 exact row key with one get. Would there be something wrong with doing
 this.. ie, using a modulo of the hash of the key?

 On Sat, May 17, 2014 at 8:28 PM, Software Dev static.void@gmail.com
 wrote:
  I recently came across the pattern of adding a salting prefix to the
  row keys to prevent hotspotting. Still trying to wrap my head around
  it and I have a few questions.
 
  - Is there ever a reason to salt to more buckets than there are region
  servers? The only reason why I think that may be beneficial is to
  anticipate future growth???
 
  - Is it beneficial to always hash against a known number of buckets
  (ie never change the size) that way for any individual row key you can
  always determine the prefix?
 
  - Are there any good use cases of this pattern out in the wild?
 
  Thanks



Re: Questions on FuzzyRowFilter

2014-05-16 Thread James Taylor
Hi Mike,
I agree with you - the way you've outlined is exactly the way Phoenix has
implemented it. It's a bit of a problem with terminology, though. We call
it salting: http://phoenix.incubator.apache.org/salted.html. We hash the
key, mod the hash with the SALT_BUCKET value you provide, and prepend the
row key with this single byte value. Maybe you can coin a good term for
this technique?

FWIW, you don't lose the ability to do a range scan when you salt (or
hash-the-key and mod by the number of buckets), but you do need to run a
scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then
the client does a merge sort among these scans. It performs well.

Thanks,
James


On Fri, May 9, 2014 at 11:57 PM, Michael Segel michael_se...@hotmail.comwrote:

 3+ Years on and a bad idea is being propagated again.

 Now repeat after me… DO NO USE A SALT.

 Having a low sodium diet, especially for HBase is really good for your
 health and sanity.

 The salt is going to be orthogonal to the row key (Key).
 There is no relationship to the specific Key.

 Using a salt means you now use the ability to randomly spread the
 distribution of data to avoid HOT SPOTTING.
 However you lose the ability to seek for a specific row.

 YOU HASH THE KEY.

 The hash whether you use SHA-1 or MD-5 is going to yield the same result
 each and every time you provide the key.

 But wait, the generated hash is 160 bits long. We don’t need that!
 Absolutely true if you just want to randomize the key to avoid hot
 spotting. There’s this concept called truncating the hash to the desired
 length.
 So to Adrien’s point, you can truncate it to a single byte which would be
 sufficient….
 Now when you want to seek for a specific row, you can find it.

 The downside to any solution is that you lose the ability to do a range
 scan.
 BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A
 SINGLE ROW VIA A get() CALL.

 rant
 This simple fact has been pointed out several years ago, yet for some
 reason, the use of a salt persists.
 I’ve actually made that part of the HBase course I wrote and use it in my
 presentation(s) on HBase.

 It amazes me that the committers and regulars who post here still don’t
 grok the fact that if you’re going to ‘SALT’ a row, you might as well not
 use HBase and stick with Hive.
 I remember Ed C’s rant about how preferential treatment on Hive patches
 was given to vendors’ committers… that preferential treatment seems to also
 be extended speakers at conferences. It wouldn’t be a problem if those said
 speakers actually knew the topic… ;-)

 Propagation of bad ideas means that you’re leaving a lot of performance on
 the table and it can kill or cripple projects.

 /rant

 Sorry for the rant…

 -Mike




 On May 3, 2014, at 4:39 PM, Software Dev static.void@gmail.com
 wrote:

  Ok so there is no way around the FuzzyRowFilter checking every single
  row in the table correct? If so, what is a valid use case for that
  filter?
 
  Ok so salt to a low enough prefix that makes scanning reasonable. Our
  client for accessing these tables is a Rails (not JRuby) application
  so we are stuck with either the Thrift or Rails client. Can either of
  these perform multiple gets/scans?
 
 
 
  On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet adrien.moge...@gmail.com
 wrote:
  Using 4 random bytes you'll get 2^32 possibilities; thus your data can
 be
  split enough among all the possible regions, but you won't be able to
  easily benefit from distributed scans to gather what you want.
 
  Let say you want to split (time+login) with a salted key and you expect
 to
  be able to retrieve events from 20140429 pretty fast. Then I would split
  input data among 10 spans, spread over 10 regions and 10 RS (ie:
 `$random
  % 10'). To retrieve ordered data, I would parallelize Scans over the 10
  span groups (00-20140429, 01-20140429...) and merge-sort everything
  until I've got all the expected results.
 
  So in term of performances this looks a little bit faster than your
 2^32
  randomization.
 
 
  On Fri, May 2, 2014 at 10:09 PM, Software Dev 
 static.void@gmail.comwrote:
 
  I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
  time series data (20140501, 20140502...).  We can prefix all of the
  keys with 4 random bytes and then just skip these during scanning. Is
  that correct? These *seems* like it will work but Im questioning the
  performance of this even if it does work.
 
  Also, is this available via the rest client, shell and/or thrift
 client?
 
  Also, is there a FuzzyColumn equivalent of this feature?
 
 
 
 
  --
  Adrien Mogenet
  http://www.borntosegfault.com
 




Re: How to implement sorting in HBase scans for a particular column

2014-04-29 Thread James Taylor
Hi Vikram,
I see you sent the Phoenix mailing list back in Dec a question on how to
use Phoenix 2.1.2 with Hadoop 2 for HBase 0.94. Looks like you were having
trouble building Phoenix with the hadoop2 profile. In our 3.0/4.0 we bundle
the phoenix jars pre-built with both hadoop1 and hadoop2, so there's
nothing you need to do.

Did you have any other issues?

Regarding sorting rows, Apache Phoenix handles this for you when you do an
ORDER BY:
CREATE TABLE names(id VARCHAR NOT NULL PRIMARY KEY,
name VARCHAR, age INTEGER);
// populate the table
SELECT * FROM names ORDER BY age;

Thanks,
James


On Tue, Apr 29, 2014 at 5:33 AM, Vikram Singh Chandel 
vikramsinghchan...@gmail.com wrote:

 Yes we have looked, but way back in November December 2013 when it was
 having a lot of issue and because of which we decided not to use it. We
 built our solution design on Hbase alone. So we are looking for a better
 solution.

 Thanks


 On Tue, Apr 29, 2014 at 5:46 PM, Ted Yu yuzhih...@gmail.com wrote:

  Have you looked at Apache Phoenix ?
 
  Cheers
 
  On Apr 29, 2014, at 2:13 AM, Vikram Singh Chandel 
  vikramsinghchan...@gmail.com wrote:
 
   Hi
  
   We have a requirement in which we have to get the scan result sorted
 on a
   particular column.
  
   eg. *Get Details of Authors sorted by their Publication Count. Limit
  :1000 *
  
   *Row Key is a MD5 hash of Author Id*
  
   Number of records 8.2 million rows for 3 year data.(sample dataset,
  actual
   data set is 30 years)
  
   We are currently looking in to implement a *comparator *and sort the
   values. But but for this first we have to store all 8.2 m records in a
   map/list and then sort. And this approach is neither memory efficient
 nor
   time efficient.
  
   Is there any solution via which this kind of request can be fulfilled
 in
   real time?
  
  
  
   --
   *Regards*
  
   *VIKRAM SINGH CHANDEL*
  
   Please do not print this email unless it is absolutely
 necessary,Reduce.
   Reuse. Recycle. Save our planet.
 



 --
 *Regards*

 *VIKRAM SINGH CHANDEL*

 Please do not print this email unless it is absolutely necessary,Reduce.
 Reuse. Recycle. Save our planet.



Re: How to get specified rows and avoid full table scanning?

2014-04-21 Thread James Taylor
Tao,
Just wanted to give you a couple of relevant pointers to Apache Phoenix for
your particular problem:
- Preventing hotspotting by salting your table:
http://phoenix.incubator.apache.org/salted.html
- Pig Integration for your map/reduce job:
http://phoenix.incubator.apache.org/pig_integration.html

What kind of processing will you be doing in your map-reduce job? FWIW,
Phoenix will allow you to run SQL queries directly over your data, so that
might be an alternative for some of the processing you need to do.

Thanks,
James


On Mon, Apr 21, 2014 at 9:20 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Tao,

 also, if you are thinking about time series, you can take a look at TSBD
 http://opentsdb.net/

 JM


 2014-04-21 11:56 GMT-04:00 Ted Yu yuzhih...@gmail.com:

  There're several alternatives.
  One of which is HBaseWD :
 
 
 http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
 
  You can also take a look at Phoenix.
 
  Cheers
 
 
  On Mon, Apr 21, 2014 at 8:04 AM, Tao Xiao xiaotao.cs@gmail.com
  wrote:
 
   I have a big table and rows will be added to this table each day. I
 wanna
   run a MapReduce job over this table and select rows of several days as
  the
   job's input data. How can I achieve this?
  
   If I prefix the rowkey with the date, I can easily select one day's
 data
  as
   the job's input, but this will involve hot spot problem because
 hundreds
  of
   millions of rows will be added to this table each day and the data will
   probably go to a single region server. Secondary index would be good
 for
   query but not good for a batch processing job.
  
   Are there any other ways?
  
   Are there any other frameworks which can achieve this goal easieruser?
   Shark? Stinger?HSearch?
  
 



[ANNOUNCE] Apache Phoenix releases next major version

2014-04-12 Thread James Taylor
The Apache Phoenix team is pleased to announce the release of its next
major version (3.0 and 4.0) from the Apache Incubator. Phoenix is a SQL
query engine for Apache HBase, a NoSQL data store. It is accessed as a JDBC
driver and enables querying and managing HBase tables using SQL.
Major new features include support for: HBase 0.98, joins, shared tables,
views, multi-tenancy, sequences, and arrays.
For more information, see our release announcement here:
https://blogs.apache.org/phoenix/entry/apache_phoenix_released_next_major
Regards,The Apache Phoenix team


Re: [VOTE] The 4th HBase 0.98.1 release candidate (RC3) is available for download

2014-04-03 Thread James Taylor
I implore you to stick with releasing RC3. Phoenix 4.0 has no release it
can currently run on. Phoenix doesn't use SingleColumnValueFilter, so it
seems that HBASE-10850 has no impact wrt Phoenix. Can't we get these
additional bugs in 0.98.2 - it's one month away [1]?

James

[1] http://en.wikipedia.org/wiki/The_Mythical_Man-Month


On Thu, Apr 3, 2014 at 3:34 AM, ramkrishna vasudevan 
ramkrishna.s.vasude...@gmail.com wrote:

 Will target HBASE-10899 also then by that time.

 Regards
 Ram


 On Thu, Apr 3, 2014 at 3:47 PM, Ted Yu yuzhih...@gmail.com wrote:

  Understood, Andy.
 
  I have integrated fix for HBASE-10850 to 0.98
 
  Cheers
 
 
  On Thu, Apr 3, 2014 at 3:00 AM, Andrew Purtell andrew.purt...@gmail.com
  wrote:
 
   I will sink this RC and roll a new one tomorrow.
  
   However, I may very well release the next RC even if I am the only +1
  vote
   and testing it causes your workstation to catch fire. So please take
 the
   time to commit whatever you feel is needed to the 0.98 branch or file
   blockers against 0.98.1 in the next 24 hours. This is it for 0.98.1.
0.98.2 will happen a mere 30 days from the 0.98.1 release.
  
On Apr 3, 2014, at 11:21 AM, Ted Yu yuzhih...@gmail.com wrote:
   
I agree with Anoop's assessment.
   
Cheers
   
On Apr 3, 2014, at 2:19 AM, Anoop John anoop.hb...@gmail.com
 wrote:
   
After analysing HBASE-10850  I think better we can fix this in 98.1
   release
itself.  Also Phoenix plan to use this 98.1 and Phoenix uses
 essential
   CF
optimization.
   
Also HBASE-10854 can be included in 98.1 in such a case,
   
Considering those we need a new RC.
   
-Anoop-
   
On Tue, Apr 1, 2014 at 10:19 AM, ramkrishna vasudevan 
ramkrishna.s.vasude...@gmail.com wrote:
   
+1 on the RC.
Checked the signature.
Downloaded the source, built and ran the testcases.
Ran Integration Tests with ACL and Visibility labels.  Everything
  looks
fine.
Compaction, flushes etc too.
   
Regards
Ram
   
   
   
On Tue, Apr 1, 2014 at 2:14 AM, Elliott Clark ecl...@apache.org
   wrote:
   
+1
   
Checked the hash
Checked the tar layout.
Played with a single node.  Everything seemed good after ITBLL
   
   
On Mon, Mar 31, 2014 at 9:23 AM, Stack st...@duboce.net wrote:
   
+1
   
The hash is good.  Doc. and layout looks good.  UI seems fine.
   
Ran on small cluster w/ default hadoop 2.2 in hbase against a tip
  of
the
branch hadoop 2.4 cluster.  Seems to basically work (small big
  linked
list
test worked).
   
TSDB seems to work fine against this RC.
   
I don't mean to be stealing our Jon's thunder but in case he is
 too
occupied to vote here, I'll note that he has gotten our internal
  rig
running against the tip of the 0.98 branch and it has been
 passing
green
running IT tests on a small cluster over hours.
   
St.Ack
   
   
   
   
On Sun, Mar 30, 2014 at 12:49 AM, Andrew Purtell 
   apurt...@apache.org
wrote:
   
The 4th HBase 0.98.1 release candidate (RC3) is available for
download
at
http://people.apache.org/~apurtell/0.98.1RC3/ and Maven
 artifacts
are
also
available in the temporary repository
   
  https://repository.apache.org/content/repositories/orgapachehbase-1016
   
Signed with my code signing key D5365CCD.
   
The issues resolved in this release can be found here:
   
  
 
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310753version=12325664
   
   
Please try out the candidate and vote +1/-1 by midnight Pacific
  Time
(00:00
PDT) on April 6 on whether or not we should release this as
  0.98.1.
   
--
Best regards,
   
 - Andy
   
Problems worthy of attack prove their worth by hitting back. -
  Piet
Hein
(via Tom White)
   
  
 



Re: [VOTE] The 4th HBase 0.98.1 release candidate (RC3) is available for download

2014-04-03 Thread James Taylor
It's just the optimization that's (sometimes) broken, right? The scan
still returns the correct results, no?

 On Apr 3, 2014, at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote:

 James:
 HBASE-10850 is not just about SingleColumnValueFilter. See Anoop's comment:

 https://issues.apache.org/jira/browse/HBASE-10850?focusedCommentId=13958668page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13958668

 The test case Fabien provided uses SingleColumnValueFilter but the defect
 has deeper implication beyond making SingleColumnValueFilter unusable in
 certain scenarios.

 I am find with giving the next RC a bit shorter voting period.

 Cheers


 On Thu, Apr 3, 2014 at 8:57 AM, James Taylor jtay...@salesforce.com wrote:

 I implore you to stick with releasing RC3. Phoenix 4.0 has no release it
 can currently run on. Phoenix doesn't use SingleColumnValueFilter, so it
 seems that HBASE-10850 has no impact wrt Phoenix. Can't we get these
 additional bugs in 0.98.2 - it's one month away [1]?

James

 [1] http://en.wikipedia.org/wiki/The_Mythical_Man-Month


 On Thu, Apr 3, 2014 at 3:34 AM, ramkrishna vasudevan 
 ramkrishna.s.vasude...@gmail.com wrote:

 Will target HBASE-10899 also then by that time.

 Regards
 Ram


 On Thu, Apr 3, 2014 at 3:47 PM, Ted Yu yuzhih...@gmail.com wrote:

 Understood, Andy.

 I have integrated fix for HBASE-10850 to 0.98

 Cheers


 On Thu, Apr 3, 2014 at 3:00 AM, Andrew Purtell 
 andrew.purt...@gmail.com
 wrote:

 I will sink this RC and roll a new one tomorrow.

 However, I may very well release the next RC even if I am the only +1
 vote
 and testing it causes your workstation to catch fire. So please take
 the
 time to commit whatever you feel is needed to the 0.98 branch or file
 blockers against 0.98.1 in the next 24 hours. This is it for 0.98.1.
 0.98.2 will happen a mere 30 days from the 0.98.1 release.

 On Apr 3, 2014, at 11:21 AM, Ted Yu yuzhih...@gmail.com wrote:

 I agree with Anoop's assessment.

 Cheers

 On Apr 3, 2014, at 2:19 AM, Anoop John anoop.hb...@gmail.com
 wrote:

 After analysing HBASE-10850  I think better we can fix this in
 98.1
 release
 itself.  Also Phoenix plan to use this 98.1 and Phoenix uses
 essential
 CF
 optimization.

 Also HBASE-10854 can be included in 98.1 in such a case,

 Considering those we need a new RC.

 -Anoop-

 On Tue, Apr 1, 2014 at 10:19 AM, ramkrishna vasudevan 
 ramkrishna.s.vasude...@gmail.com wrote:

 +1 on the RC.
 Checked the signature.
 Downloaded the source, built and ran the testcases.
 Ran Integration Tests with ACL and Visibility labels.  Everything
 looks
 fine.
 Compaction, flushes etc too.

 Regards
 Ram



 On Tue, Apr 1, 2014 at 2:14 AM, Elliott Clark 
 ecl...@apache.org
 wrote:

 +1

 Checked the hash
 Checked the tar layout.
 Played with a single node.  Everything seemed good after ITBLL


 On Mon, Mar 31, 2014 at 9:23 AM, Stack st...@duboce.net
 wrote:

 +1

 The hash is good.  Doc. and layout looks good.  UI seems fine.

 Ran on small cluster w/ default hadoop 2.2 in hbase against a
 tip
 of
 the
 branch hadoop 2.4 cluster.  Seems to basically work (small big
 linked
 list
 test worked).

 TSDB seems to work fine against this RC.

 I don't mean to be stealing our Jon's thunder but in case he is
 too
 occupied to vote here, I'll note that he has gotten our
 internal
 rig
 running against the tip of the 0.98 branch and it has been
 passing
 green
 running IT tests on a small cluster over hours.

 St.Ack




 On Sun, Mar 30, 2014 at 12:49 AM, Andrew Purtell 
 apurt...@apache.org
 wrote:

 The 4th HBase 0.98.1 release candidate (RC3) is available for
 download
 at
 http://people.apache.org/~apurtell/0.98.1RC3/ and Maven
 artifacts
 are
 also
 available in the temporary repository
 https://repository.apache.org/content/repositories/orgapachehbase-1016

 Signed with my code signing key D5365CCD.

 The issues resolved in this release can be found here:
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310753version=12325664


 Please try out the candidate and vote +1/-1 by midnight
 Pacific
 Time
 (00:00
 PDT) on April 6 on whether or not we should release this as
 0.98.1.

 --
 Best regards,

 - Andy

 Problems worthy of attack prove their worth by hitting back. -
 Piet
 Hein
 (via Tom White)



Re: [VOTE] The 4th HBase 0.98.1 release candidate (RC3) is available for download

2014-04-03 Thread James Taylor
+1 to Andrew's suggestion. @Anoop - would you mind verifying whether or not
the TestSCVFWithMiniCluster written as a Phoenix query returns the correct
results?


On Thu, Apr 3, 2014 at 9:34 AM, Andrew Purtell andrew.purt...@gmail.comwrote:

 This would be my preference also.

 Can someone provide a definitive statement on if a critical/blocker bug
 exists for Phoenix or not? If not, we have sufficient votes at this point
 to carry the RC and can go forward with the release at the end of the vote
 period.


  On Apr 3, 2014, at 5:57 PM, James Taylor jtay...@salesforce.com wrote:
 
  I implore you to stick with releasing RC3. Phoenix 4.0 has no release it
  can currently run on. Phoenix doesn't use SingleColumnValueFilter, so it
  seems that HBASE-10850 has no impact wrt Phoenix. Can't we get these
  additional bugs in 0.98.2 - it's one month away [1]?
 
 James
 
  [1] http://en.wikipedia.org/wiki/The_Mythical_Man-Month
 
 
  On Thu, Apr 3, 2014 at 3:34 AM, ramkrishna vasudevan 
  ramkrishna.s.vasude...@gmail.com wrote:
 
  Will target HBASE-10899 also then by that time.
 
  Regards
  Ram
 
 
  On Thu, Apr 3, 2014 at 3:47 PM, Ted Yu yuzhih...@gmail.com wrote:
 
  Understood, Andy.
 
  I have integrated fix for HBASE-10850 to 0.98
 
  Cheers
 
 
  On Thu, Apr 3, 2014 at 3:00 AM, Andrew Purtell 
 andrew.purt...@gmail.com
  wrote:
 
  I will sink this RC and roll a new one tomorrow.
 
  However, I may very well release the next RC even if I am the only +1
  vote
  and testing it causes your workstation to catch fire. So please take
  the
  time to commit whatever you feel is needed to the 0.98 branch or file
  blockers against 0.98.1 in the next 24 hours. This is it for 0.98.1.
  0.98.2 will happen a mere 30 days from the 0.98.1 release.
 
  On Apr 3, 2014, at 11:21 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  I agree with Anoop's assessment.
 
  Cheers
 
  On Apr 3, 2014, at 2:19 AM, Anoop John anoop.hb...@gmail.com
  wrote:
 
  After analysing HBASE-10850  I think better we can fix this in 98.1
  release
  itself.  Also Phoenix plan to use this 98.1 and Phoenix uses
  essential
  CF
  optimization.
 
  Also HBASE-10854 can be included in 98.1 in such a case,
 
  Considering those we need a new RC.
 
  -Anoop-
 
  On Tue, Apr 1, 2014 at 10:19 AM, ramkrishna vasudevan 
  ramkrishna.s.vasude...@gmail.com wrote:
 
  +1 on the RC.
  Checked the signature.
  Downloaded the source, built and ran the testcases.
  Ran Integration Tests with ACL and Visibility labels.  Everything
  looks
  fine.
  Compaction, flushes etc too.
 
  Regards
  Ram
 
 
 
  On Tue, Apr 1, 2014 at 2:14 AM, Elliott Clark ecl...@apache.org
  wrote:
 
  +1
 
  Checked the hash
  Checked the tar layout.
  Played with a single node.  Everything seemed good after ITBLL
 
 
  On Mon, Mar 31, 2014 at 9:23 AM, Stack st...@duboce.net wrote:
 
  +1
 
  The hash is good.  Doc. and layout looks good.  UI seems fine.
 
  Ran on small cluster w/ default hadoop 2.2 in hbase against a tip
  of
  the
  branch hadoop 2.4 cluster.  Seems to basically work (small big
  linked
  list
  test worked).
 
  TSDB seems to work fine against this RC.
 
  I don't mean to be stealing our Jon's thunder but in case he is
  too
  occupied to vote here, I'll note that he has gotten our internal
  rig
  running against the tip of the 0.98 branch and it has been
  passing
  green
  running IT tests on a small cluster over hours.
 
  St.Ack
 
 
 
 
  On Sun, Mar 30, 2014 at 12:49 AM, Andrew Purtell 
  apurt...@apache.org
  wrote:
 
  The 4th HBase 0.98.1 release candidate (RC3) is available for
  download
  at
  http://people.apache.org/~apurtell/0.98.1RC3/ and Maven
  artifacts
  are
  also
  available in the temporary repository
  https://repository.apache.org/content/repositories/orgapachehbase-1016
 
  Signed with my code signing key D5365CCD.
 
  The issues resolved in this release can be found here:
 
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310753version=12325664
 
 
  Please try out the candidate and vote +1/-1 by midnight Pacific
  Time
  (00:00
  PDT) on April 6 on whether or not we should release this as
  0.98.1.
 
  --
  Best regards,
 
  - Andy
 
  Problems worthy of attack prove their worth by hitting back. -
  Piet
  Hein
  (via Tom White)
 



Re: how to reverse an integer for rowkey?

2014-03-27 Thread James Taylor
Another option is to use Apache Phoenix and let it do these things for you:
CREATE TABLE my_table(
intField INTEGER,
strField VARCHAR,
CONSTRAINT pk PRIMARY KEY (intField DESC, strField));

Thanks,
James
@JamesPlusPlus
http://phoenix.incubator.apache.org/


On Thu, Mar 27, 2014 at 1:24 AM, Li Li fancye...@gmail.com wrote:

 my rowkey is strField,intField
 I want to scan it by decreasing order of the int field, how to make it
 reversed?
 if the row key is Bytes.toBytes(intField) + Bytes.toBytes(strField),
 then the order is increasing.
 one solution is replace intField with -intField. but if
 intField==Integer.MIN_VALUE, what will happen?



Re: Filters failing to compare negative numbers (int,float,double or long)

2014-03-19 Thread James Taylor
Another option is to use Apache Phoenix (
http://phoenix.incubator.apache.org/) as it takes care of all these details
for you automatically.

Cheers,
James


On Wed, Mar 19, 2014 at 7:49 AM, Ted Yu yuzhih...@gmail.com wrote:

 In 0.96+, extensible data type API is provided.
 Please take a look
 at
 hbase-common/src/main/java/org/apache/hadoop/hbase/types/package-info.java

 Specifically the OrderedInt16, OrderedInt32 and OrderedInt64 would be of
 interest to you.

 Cheers



 On Wed, Mar 19, 2014 at 5:33 AM, praveenesh kumar praveen...@gmail.com
 wrote:

  I am not sure if this can helps - https://github.com/ndimiduk/orderly
 
  This API supports negative keys, as far as I know. You can give it a try,
  its simple to implement your keys using this API
 
  Regards
  Prav
 
 
  On Wed, Mar 19, 2014 at 12:26 PM, Chaitanya chaitanya.ck...@gmail.com
  wrote:
 
   Hi Ramkrishna,
  
   There is one more problem, i.e. I don't have the rights to add anything
  to
   the hbase jar.
   The problem with writing a custom comparator is that we have to add it
 to
   the jar so that its available to all the hbase region servers.
   Please correct me if I am wrong.
   Thanks
  
  
  
   -
   Regards,
   Chaitanya
   --
   View this message in context:
  
 
 http://apache-hbase.679495.n3.nabble.com/Filters-failing-to-compare-negative-numbers-int-float-double-or-long-tp4057268p4057279.html
   Sent from the HBase User mailing list archive at Nabble.com.
  
 



Re: org.apache.hadoop.hbase.ipc.SecureRpcEngine class not found in HBase jar

2014-03-04 Thread James Taylor
Let's just target your patch for the Phoenix 4.0 release so we can rely on
Maven having what we need.

Thanks,
James


On Tue, Mar 4, 2014 at 11:29 AM, anil gupta anilgupt...@gmail.com wrote:

 Phoenix refers to maven artifact of HBase. If its not in Maven repo of
 HBase then either we add the security jar in Maven repo or  we will have to
 find another solution to resolve this.




 On Tue, Mar 4, 2014 at 11:10 AM, Ted Yu yuzhih...@gmail.com wrote:

 Security tar ball is published with each 0.94 release:
 http://supergsego.com/apache/hbase/hbase-0.94.17/

 Are you able to utilize that ?



 On Tue, Mar 4, 2014 at 10:48 AM, anil gupta anilgupt...@gmail.com
 wrote:

  Thanks for the reply.
 
  Since the HBase security jar is not published in Maven repo. I am
 running
  into Problem with enhancing the jdbc connection of Phoenix(
  https://issues.apache.org/jira/browse/PHOENIX-19) to support
 connecting to
  a secure HBase cluster.
  Is there any particular reason due to which we don't publish security
 jar
  of HBase?
 
  I have been using cdh4.5 and that has hbase security. For phoenix, i
 don't
  think i can reference Cloudera stuff. If we cannot publish the security
 jar
  in Maven repo then Phoenix might have to build hbase with the flag that
  Gary mentioned.
 
  Thanks,
  Anil Gupta
 
 
  On Tue, Mar 4, 2014 at 10:40 AM, Gary Helmling ghelml...@gmail.com
  wrote:
 
   For HBase 0.94, you need a version of HBase built with the security
   profile to get SecureRpcEngine and other security classes.  I'm not
 sure
   that the published releases on maven central actually include this.
  
   However, it's easily to build yourself, just add -Psecurity to the
 mvn
   command line to get the security profile.
  
   For HBase 0.96+ this is no longer necessary, as the security classes
 are
   now part of the main build.
  
  
   On Tue, Mar 4, 2014 at 10:02 AM, anil gupta anilgupt...@gmail.com
  wrote:
  
Hi All,
   
If i create a maven project with the following maven dependency then
  the
HBase jar doesn't have org.apache.hadoop.hbase.ipc.SecureRpcEngine
  class.
 dependency
groupIdorg.apache.hbase/groupId
artifactIdhbase/artifactId
version0.94.12/version
/dependency
   
SecureRPCEngine class is used when the cluster is secured. Is there
 any
other maven dependency i need to use to get that class?
   
--
Thanks  Regards,
Anil Gupta
   
  
 
 
 
  --
  Thanks  Regards,
  Anil Gupta
 




 --
 Thanks  Regards,
 Anil Gupta



Re: HBase Schema for IPTC News ML G2

2014-03-03 Thread James Taylor
Hi Jigar,
Take a look at Apache Phoenix: http://phoenix.incubator.apache.org/
It allows you to use SQL to query over your HBase data and supports
composite primary keys, so you could create a schema like this:

create table news_message(guid varchar not null, version bigint not null,
constraint pk primary key (guid, version desc));

The rows will then sort by guid plus version descending. Then you can issue
sql queries directly against your hbase data without writing map/reduce.
Note that we don't yet support all the sql constructs that postgres does.

HTH,
James


On Sun, Mar 2, 2014 at 11:23 PM, Jigar Shah jigar.s...@infodesk.com wrote:

 I am working in news processing industry, current system processes more
 then million article per week. And provides this data in real time to
 users, additionally it provides search capabilities via Lucene.

 We convert all news to a standard IPTC NewsML
 G2http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ 
 http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/format,
 before providing it to users (in real-time or via search)

 We have a requirement of component which provides analytical queries on
 news data. I plan to load this all data in HBase and then have Map-Reduce
 Jobs to compute analytical queries. More over current system is developed
 on postgresql to store only 3 months data, anything more then this is big
 data as it dosen't fit on one server.

 But i am bit confused in developing schema for it.

 Every news article has

 *messageID as guid*, unique id for news message.
 *version as int,* incremented if newer version of same news message is
 published.
 there are other fields like location, channels, title, content, source
 etc..

 Current database primary key is a composite of (messageID  version).

 I thought that, i should use messageID as rowKey in HBase. and
 version as columnFamily and all columns will be fields of news (like
 location, channels ,title, body, sentTimstamp, ...)

 Keeping version as columnFamily is a good idea ?

 In reality single message may have thousands of version.




Re: creating tables from mysql to hbase

2014-02-18 Thread James Taylor
Hi Jignesh,
Phoenix has support for multi-tenant tables:
http://phoenix.incubator.apache.org/multi-tenancy.html. Also, your primary
key constraint would transfer over as-is, since Phoenix supports composite
row keys. Essentially your pk constraint values get concatenated together
to form your row key in HBase. We do not support unique constraints yet,
but we do support secondary indexing:
http://phoenix.incubator.apache.org/secondary_indexing.html

HTH. Thanks,

James


On Tue, Feb 18, 2014 at 6:06 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 You might want to take a look at Phoenix
 http://phoenix.incubator.apache.org/

 JM


 2014-02-18 19:29 GMT-05:00 Jignesh Patel jigneshmpa...@gmail.com:

  Jean,
 
  We have a product which is working on mysql and we are trying to move it
 on
  HBase to create multi-tenant database.
  I agree with you that because of nosql nature of database, we should
  denormalize the mysql database. However, was thinking of making quick
  progress by first creating dirty structure of existing database in
  hbase(from mysql) and than optimizing/modifying in nosql way.
 
 
 
  On Tue, Feb 18, 2014 at 4:11 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
   Moving the discussion to the user mailing list.
  
   Hi Jignesh,
  
   You can not really map mysql tables to HBase. You need to rethink your
   schema when moving to HBase. Like in MySQL, a key can be on multiple
   columns. In HBase, it's they key itself. etc.
  
   What are you trying to achieve?
  
   JM
  
  
   2014-02-18 15:19 GMT-05:00 Jignesh Patel jigneshmpa...@gmail.com:
  
Has anybody used for creating tables from msyql to Hbase through
 cloud
graph?
   
  
 



Re: HBase load distribution vs. scan efficiency

2014-01-20 Thread James Taylor
Hi William,
Phoenix uses this bucket mod solution as well (
http://phoenix.incubator.apache.org/salted.html). For the scan, you have to
run it in every possible bucket. You can still do a range scan, you just
have to prepend the bucket number to the start/stop key of each scan you
do, and then you do a merge sort with the results. Phoenix does all this
transparently for you.
Thanks,
James


On Mon, Jan 20, 2014 at 4:51 PM, William Kang weliam.cl...@gmail.comwrote:

 Hi,
 Thank you guys. This is an informative email chain.

 I have one follow up question about using the bucket mod solution. Once
 you add the bucket number as the prefix to the key, how do you retrieve the
 rows? Do you have to use a rowfilter? Will there be any performance issue
 of using the row filter since it seems that would be a full table scan?

 Many thanks.


 William


 On Mon, Jan 20, 2014 at 5:06 AM, Amit Sela am...@infolinks.com wrote:

  The number of scans depends on the number of regions a day's data uses.
 You
  need to manage compaction and splitting manually.
  If a days data is 100MB and you want regions to be no more than 200MB
 than
  it's two regions to scan per day, if it's 1GB than 10 etc.
  Compression will help you maximize data per region and as I've recently
  learned, if your key occupies most of the byes in KeyValue (key is longer
  than family, qualifier and value) than compression can be very
 efficient, I
  have a case where 100GB is compressed to 7.
 
 
 
  On Mon, Jan 20, 2014 at 6:56 AM, Vladimir Rodionov
  vrodio...@carrieriq.comwrote:
 
   Ted, how does it differ from row key salting?
  
   Best regards,
   Vladimir Rodionov
   Principal Platform Engineer
   Carrier IQ, www.carrieriq.com
   e-mail: vrodio...@carrieriq.com
  
   
   From: Ted Yu [yuzhih...@gmail.com]
   Sent: Sunday, January 19, 2014 6:53 PM
   To: user@hbase.apache.org
   Subject: Re: HBase load distribution vs. scan efficiency
  
   Bill:
   See  http://blog.sematext.com/2012/04/09/hbasewd
  
  
 
 -avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
  
   FYI
  
  
   On Sun, Jan 19, 2014 at 4:02 PM, Bill Q bill.q@gmail.com wrote:
  
Hi Amit,
Thanks for the reply.
   
If I understand your suggestion correctly, and assuming we have 100
   region
servers, I would have to do 100 scans to merge reads if I want to
 pull
   any
data for a specific date. Is that correct? Is the 100 scans the most
efficient way to deal with this issue?
   
Any thoughts?
   
Many thanks.
   
   
Bill
   
   
On Sun, Jan 19, 2014 at 4:02 PM, Amit Sela am...@infolinks.com
  wrote:
   
 If you'll use bulk load to insert your data you could use the date
 as
   key
 prefix and choose the rest of the key in a way that will split each
  day
 evenly. You'll have X regions for Evey day  14X regions for the
 two
weeks
 window.
 On Jan 19, 2014 8:39 PM, Bill Q bill.q@gmail.com wrote:

  Hi,
  I am designing a schema to host some large volume of data over
  HBase.
We
  collect daily trading data for some markets. And we run a moving
   window
  analysis to make predictions based on a two weeks window.
 
  Since everybody is going to pull the latest two weeks data every
  day,
if
 we
  put the date in the lead positions of the Key, we will have some
  hot
  regions. So, we can use bucketing (date to mode bucket number)
   approach
 to
  deal with this situation. However, if we have 200 buckets, we
 need
  to
run
  200 scans to extract all the data in the last two weeks.
 
  My questions are:
  1. What happens when each scan return the result? Will the scan
   result
be
  sent to a sink  like place that collects and concatenate all the
  scan
  results?
  2. Why having 200 scans might be a bad thing compared to have
 only
  10
  scans?
  3. Any suggestions to the design?
 
  Many thanks.
 
 
  Bill
 

   
  
   Confidentiality Notice:  The information contained in this message,
   including any attachments hereto, may be confidential and is intended
 to
  be
   read only by the individual or entity to whom this message is
 addressed.
  If
   the reader of this message is not the intended recipient or an agent or
   designee of the intended recipient, please note that any review, use,
   disclosure or distribution of this message or its attachments, in any
  form,
   is strictly prohibited.  If you have received this message in error,
  please
   immediately notify the sender and/or notificati...@carrieriq.com and
   delete or destroy any copy of this message and its attachments.
  
 



Re: HBase load distribution vs. scan efficiency

2014-01-20 Thread James Taylor
The salt byte is a stable hash of the rest of the row key. The system has
to remember the total number of buckets, as that's what's used to mod the
hash value with. Adding new regions/regions servers is fine, as it's
orthogonal to the bucket number, though typically the cluster size
determines the total number of salt buckets. Phoenix does not allow you to
change the number of salt buckets for a table after it's created (you'd
need to re-write the table in order to do that).

The salt byte is completely transparent in Phoenix, as your API is SQL
through JDBC. Phoenix manages setting the salt byte, skipping it when
interpreting the row key columns, knowing that a range scan needs to run on
all possible bucket numbers, and that point gets don't, etc.

Thanks,
James


On Mon, Jan 20, 2014 at 6:59 PM, William Kang weliam.cl...@gmail.comwrote:

 Hi James,
 Thanks for the link.

 Does this mean that the system has to remember the prefix, and append the
 prefix to the original key before the scan starts?

 If this is the case, if I somehow decide to change the prefix (maybe added
 many more RS, or want to use a different salting mechanism), it might cause
 all sorts of issues?

 If this is not the case, how would a user know what prefix to append to
 start the scan? This is why I asked about row filter, since you can use
 regex to match the original key and skip the prefix. But I am wondering the
 implications of performance if uses row filter.

 Many thanks.


 William


 On Mon, Jan 20, 2014 at 8:15 PM, James Taylor jtay...@salesforce.com
 wrote:

  Hi William,
  Phoenix uses this bucket mod solution as well (
  http://phoenix.incubator.apache.org/salted.html). For the scan, you have
  to
  run it in every possible bucket. You can still do a range scan, you just
  have to prepend the bucket number to the start/stop key of each scan you
  do, and then you do a merge sort with the results. Phoenix does all this
  transparently for you.
  Thanks,
  James
 
 
  On Mon, Jan 20, 2014 at 4:51 PM, William Kang weliam.cl...@gmail.com
  wrote:
 
   Hi,
   Thank you guys. This is an informative email chain.
  
   I have one follow up question about using the bucket mod solution.
 Once
   you add the bucket number as the prefix to the key, how do you retrieve
  the
   rows? Do you have to use a rowfilter? Will there be any performance
 issue
   of using the row filter since it seems that would be a full table scan?
  
   Many thanks.
  
  
   William
  
  
   On Mon, Jan 20, 2014 at 5:06 AM, Amit Sela am...@infolinks.com
 wrote:
  
The number of scans depends on the number of regions a day's data
 uses.
   You
need to manage compaction and splitting manually.
If a days data is 100MB and you want regions to be no more than 200MB
   than
it's two regions to scan per day, if it's 1GB than 10 etc.
Compression will help you maximize data per region and as I've
 recently
learned, if your key occupies most of the byes in KeyValue (key is
  longer
than family, qualifier and value) than compression can be very
   efficient, I
have a case where 100GB is compressed to 7.
   
   
   
On Mon, Jan 20, 2014 at 6:56 AM, Vladimir Rodionov
vrodio...@carrieriq.comwrote:
   
 Ted, how does it differ from row key salting?

 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: Ted Yu [yuzhih...@gmail.com]
 Sent: Sunday, January 19, 2014 6:53 PM
 To: user@hbase.apache.org
 Subject: Re: HBase load distribution vs. scan efficiency

 Bill:
 See  http://blog.sematext.com/2012/04/09/hbasewd


   
  
 
 -avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/

 FYI


 On Sun, Jan 19, 2014 at 4:02 PM, Bill Q bill.q@gmail.com
  wrote:

  Hi Amit,
  Thanks for the reply.
 
  If I understand your suggestion correctly, and assuming we have
 100
 region
  servers, I would have to do 100 scans to merge reads if I want to
   pull
 any
  data for a specific date. Is that correct? Is the 100 scans the
  most
  efficient way to deal with this issue?
 
  Any thoughts?
 
  Many thanks.
 
 
  Bill
 
 
  On Sun, Jan 19, 2014 at 4:02 PM, Amit Sela am...@infolinks.com
wrote:
 
   If you'll use bulk load to insert your data you could use the
  date
   as
 key
   prefix and choose the rest of the key in a way that will split
  each
day
   evenly. You'll have X regions for Evey day  14X regions for
 the
   two
  weeks
   window.
   On Jan 19, 2014 8:39 PM, Bill Q bill.q@gmail.com
 wrote:
  
Hi,
I am designing a schema to host some large volume of data
 over
HBase.
  We
collect daily trading data for some markets. And we run a
  moving

Re: Question on efficient, ordered composite keys

2014-01-14 Thread James Taylor
Hi Henning,
My favorite implementation of efficient composite row keys is Phoenix. We
support composite row keys whose byte representation sorts according to the
natural sort order of the values (inspired by Lily). You can use our type
system independent of querying/inserting data with Phoenix, the advantage
being that when you want to support adhoc querying through SQL using
Phoenix, it'll just work.

Thanks,
James


On Tue, Jan 14, 2014 at 7:02 AM, Ted Yu yuzhih...@gmail.com wrote:

 Please take a look at HBASE-8089 which is an umbrella JIRA.
 Some of its subtasks are in 0.96

 bq. claiming that short keys (as well as short column names) are relevant
 bq. Is that also true in 0.94.x?

 That is true in 0.94.x

 Cheers


 On Tue, Jan 14, 2014 at 6:56 AM, Henning Blohm henning.bl...@zfabrik.de
 wrote:

  Hi,
 
  for an application still running on Hbase 0.90.4 (but moving to 0.94.6)
 we
  are thinking about using more efficient composite row keys compared what
 we
  use today (fixed length strings with / separator).
 
  I ran into http://hbase.apache.org/book/rowkey.design.html claiming that
  short keys (as well as short column names) are relevant also when using
  compression (as there is no compression in caches/indices). Is that also
  true in 0.94.x?
 
  If so, is there some support for efficient, correctly ordered, byte[]
  serialized composite row keys? I ran into HBASE-7221 
  https://issues.apache.org/jira/browse/HBASE-7221 and HBASE-7692.
 
  For some time it seemed Orderly (https://github.com/ndimiduk/orderly)
 was
  suggested but then abandoned again in favor of ... nothing really.
 
  So, in short, do you have any favorite / suggested implementation?
 
  Thanks,
  Henning
 



Re: use hbase as distributed crawl's scheduler

2014-01-04 Thread James Taylor
Please take a look at our Apache incubator proposal, as I think that may
answer your questions: https://wiki.apache.org/incubator/PhoenixProposal


On Fri, Jan 3, 2014 at 11:47 PM, Li Li fancye...@gmail.com wrote:

 so what's the relationship of Phoenix and HBase? something like hadoop and
 hive?


 On Sat, Jan 4, 2014 at 3:43 PM, James Taylor jtay...@salesforce.com
 wrote:
  Hi LiLi,
  Phoenix isn't an experimental project. We're on our 2.2 release, and many
  companies (including the company for which I'm employed, Salesforce.com)
  use it in production today.
  Thanks,
  James
 
 
  On Fri, Jan 3, 2014 at 11:39 PM, Li Li fancye...@gmail.com wrote:
 
  hi James,
  phoenix seems great but it's now only a experimental project. I
  want to use only hbase. could you tell me the difference of Phoenix
  and hbase? If I use hbase only, how should I design the schema and
  some extra things for my goal? thank you
 
  On Sat, Jan 4, 2014 at 3:41 AM, James Taylor jtay...@salesforce.com
  wrote:
   On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com
  wrote:
  
   Couple of notes:
   1. When updating to status you essentially add a new rowkey into
 HBase,
  I
   would give it up all together. The essential requirement seems to
 point
  at
   retrieving a list of urls in a certain order.
  
   Not sure on this, but seemed to me that setting the status field is
  forcing
   the urls that have been processed to be at the end of the sort order.
  
   2. Wouldn't salting ruin the sort order required? Priority, date
 added?
  
   No, as Phoenix maintains returning rows in row key order even when
  they're
   salted. We do parallel scans for each bucket and do a merge sort on
 the
   client, so the cost is pretty low for this (we also provide a way of
   turning this off if your use case doesn't need it).
  
   Two years, JM? Now you're really going to have to start using Phoenix
 :-)
  
  
   On Friday, January 3, 2014, James Taylor wrote:
  
Sure, no problem. One addition: depending on the cardinality of
 your
priority column, you may want to salt your table to prevent
  hotspotting,
since you'll have a monotonically increasing date in the key. To do
  that,
just add  SALT_BUCKETS=n on to your query, where n is the
  number of
machines in your cluster. You can read more about salting here:
http://phoenix.incubator.apache.org/salted.html
   
   
On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com
 wrote:
   
 thank you. it's great.

 On Fri, Jan 3, 2014 at 3:15 PM, James Taylor 
  jtay...@salesforce.com
 wrote:
  Hi LiLi,
  Have a look at Phoenix (http://phoenix.incubator.apache.org/).
  It's
   a
 SQL
  skin on top of HBase. You can model your schema and issue your
   queries
 just
  like you would with MySQL. Something like this:
 
  // Create table that optimizes for your most common query
  // (i.e. the PRIMARY KEY constraint should be ordered as you'd
  want
your
  rows ordered)
  CREATE TABLE url_db (
  status TINYINT,
  priority INTEGER NOT NULL,
  added_time DATE,
  url VARCHAR NOT NULL
  CONSTRAINT pk PRIMARY KEY (status, priority, added_time,
  url));
 
  int lastStatus = 0;
  int lastPriority = 0;
  Date lastAddedTime = new Date(0);
  String lastUrl = ;
 
  while (true) {
  // Use row value constructor to page through results in
  batches
   of
 1000
  String query = 
  SELECT * FROM url_db
  WHERE status=0 AND (status, priority, added_time, url)
 
  (?,
   ?,
 ?,
  ?)
  ORDER BY status, priority, added_time, url
  LIMIT 1000
  PreparedStatement stmt =
 connection.prepareStatement(query);
 
  // Bind parameters
  stmt.setInt(1, lastStatus);
  stmt.setInt(2, lastPriority);
  stmt.setDate(3, lastAddedTime);
  stmt.setString(4, lastUrl);
  ResultSet resultSet = stmt.executeQuery();
 
  while (resultSet.next()) {
  // Remember last row processed so that you can start
 after
   that
 for
  next batch
  lastStatus = resultSet.getInt(1);
  lastPriority = resultSet.getInt(2);
  lastAddedTime = resultSet.getDate(3);
  lastUrl = resultSet.getString(4);
 
  doSomethingWithUrls();
 
  UPSERT INTO url_db(status, priority, added_time, url)
  VALUES (1, ?, CURRENT_DATE(), ?);
 
  }
  }
 
  If you need to efficiently query on url, add a secondary index
  like
this:
 
  CREATE INDEX url_index ON url_db (url);
 
  Please let me know if you have questions.
 
  Thanks,
  James
 
 
 
 
  On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com
  wrote:
 
  thank you. But I can't use nutch. could you tell me

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor
Sure, no problem. One addition: depending on the cardinality of your
priority column, you may want to salt your table to prevent hotspotting,
since you'll have a monotonically increasing date in the key. To do that,
just add  SALT_BUCKETS=n on to your query, where n is the number of
machines in your cluster. You can read more about salting here:
http://phoenix.incubator.apache.org/salted.html


On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:

 thank you. it's great.

 On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com
 wrote:
  Hi LiLi,
  Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a
 SQL
  skin on top of HBase. You can model your schema and issue your queries
 just
  like you would with MySQL. Something like this:
 
  // Create table that optimizes for your most common query
  // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your
  rows ordered)
  CREATE TABLE url_db (
  status TINYINT,
  priority INTEGER NOT NULL,
  added_time DATE,
  url VARCHAR NOT NULL
  CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
 
  int lastStatus = 0;
  int lastPriority = 0;
  Date lastAddedTime = new Date(0);
  String lastUrl = ;
 
  while (true) {
  // Use row value constructor to page through results in batches of
 1000
  String query = 
  SELECT * FROM url_db
  WHERE status=0 AND (status, priority, added_time, url)  (?, ?,
 ?,
  ?)
  ORDER BY status, priority, added_time, url
  LIMIT 1000
  PreparedStatement stmt = connection.prepareStatement(query);
 
  // Bind parameters
  stmt.setInt(1, lastStatus);
  stmt.setInt(2, lastPriority);
  stmt.setDate(3, lastAddedTime);
  stmt.setString(4, lastUrl);
  ResultSet resultSet = stmt.executeQuery();
 
  while (resultSet.next()) {
  // Remember last row processed so that you can start after that
 for
  next batch
  lastStatus = resultSet.getInt(1);
  lastPriority = resultSet.getInt(2);
  lastAddedTime = resultSet.getDate(3);
  lastUrl = resultSet.getString(4);
 
  doSomethingWithUrls();
 
  UPSERT INTO url_db(status, priority, added_time, url)
  VALUES (1, ?, CURRENT_DATE(), ?);
 
  }
  }
 
  If you need to efficiently query on url, add a secondary index like this:
 
  CREATE INDEX url_index ON url_db (url);
 
  Please let me know if you have questions.
 
  Thanks,
  James
 
 
 
 
  On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:
 
  thank you. But I can't use nutch. could you tell me how hbase is used
  in nutch? or hbase is only used to store webpage.
 
  On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
  otis.gospodne...@gmail.com wrote:
   Hi,
  
   Have a look at http://nutch.apache.org .  Version 2.x uses HBase
 under
  the
   hood.
  
   Otis
   --
   Performance Monitoring * Log Analytics * Search Analytics
   Solr  Elasticsearch Support * http://sematext.com/
  
  
   On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:
  
   hi all,
I want to use hbase to store all urls(crawled or not crawled).
   And each url will has a column named priority which represent the
   priority of the url. I want to get the top N urls order by
 priority(if
   priority is the same then url whose timestamp is ealier is prefered).
in using something like mysql, my client application may like:
while true:
select  url from url_db order by priority,addedTime limit
   1000 where status='not_crawled';
do something with this urls;
extract more urls and insert them into url_db;
How should I design hbase schema for this application? Is hbase
   suitable for me?
I found in this article
  
 
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
   ,
   they use redis to store urls. I think hbase is originated from
   bigtable and google use bigtable to store webpage, so for huge number
   of urls, I prefer distributed system like hbase.
  
 



Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor
On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com wrote:

 Couple of notes:
 1. When updating to status you essentially add a new rowkey into HBase, I
 would give it up all together. The essential requirement seems to point at
 retrieving a list of urls in a certain order.

Not sure on this, but seemed to me that setting the status field is forcing
the urls that have been processed to be at the end of the sort order.

2. Wouldn't salting ruin the sort order required? Priority, date added?

No, as Phoenix maintains returning rows in row key order even when they're
salted. We do parallel scans for each bucket and do a merge sort on the
client, so the cost is pretty low for this (we also provide a way of
turning this off if your use case doesn't need it).

Two years, JM? Now you're really going to have to start using Phoenix :-)


 On Friday, January 3, 2014, James Taylor wrote:

  Sure, no problem. One addition: depending on the cardinality of your
  priority column, you may want to salt your table to prevent hotspotting,
  since you'll have a monotonically increasing date in the key. To do that,
  just add  SALT_BUCKETS=n on to your query, where n is the number of
  machines in your cluster. You can read more about salting here:
  http://phoenix.incubator.apache.org/salted.html
 
 
  On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:
 
   thank you. it's great.
  
   On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com
   wrote:
Hi LiLi,
Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's
 a
   SQL
skin on top of HBase. You can model your schema and issue your
 queries
   just
like you would with MySQL. Something like this:
   
// Create table that optimizes for your most common query
// (i.e. the PRIMARY KEY constraint should be ordered as you'd want
  your
rows ordered)
CREATE TABLE url_db (
status TINYINT,
priority INTEGER NOT NULL,
added_time DATE,
url VARCHAR NOT NULL
CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
   
int lastStatus = 0;
int lastPriority = 0;
Date lastAddedTime = new Date(0);
String lastUrl = ;
   
while (true) {
// Use row value constructor to page through results in batches
 of
   1000
String query = 
SELECT * FROM url_db
WHERE status=0 AND (status, priority, added_time, url)  (?,
 ?,
   ?,
?)
ORDER BY status, priority, added_time, url
LIMIT 1000
PreparedStatement stmt = connection.prepareStatement(query);
   
// Bind parameters
stmt.setInt(1, lastStatus);
stmt.setInt(2, lastPriority);
stmt.setDate(3, lastAddedTime);
stmt.setString(4, lastUrl);
ResultSet resultSet = stmt.executeQuery();
   
while (resultSet.next()) {
// Remember last row processed so that you can start after
 that
   for
next batch
lastStatus = resultSet.getInt(1);
lastPriority = resultSet.getInt(2);
lastAddedTime = resultSet.getDate(3);
lastUrl = resultSet.getString(4);
   
doSomethingWithUrls();
   
UPSERT INTO url_db(status, priority, added_time, url)
VALUES (1, ?, CURRENT_DATE(), ?);
   
}
}
   
If you need to efficiently query on url, add a secondary index like
  this:
   
CREATE INDEX url_index ON url_db (url);
   
Please let me know if you have questions.
   
Thanks,
James
   
   
   
   
On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:
   
thank you. But I can't use nutch. could you tell me how hbase is
 used
in nutch? or hbase is only used to store webpage.
   
On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi,

 Have a look at http://nutch.apache.org .  Version 2.x uses HBase
   under
the
 hood.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Fri, Jan 3, 2014 at 1:12 AM, Li Li 



Re: secondary index feature

2014-01-03 Thread James Taylor
 have a large data set (hence HBASE) that will be queried

 (mostly

 point-gets via an index) in some linear correlation with its size.

 Is there any data on how RLI (or in particular Phoenix) query

 throughput

 correlates with the number of region servers assuming homogeneously
 distributed data?

 Thanks,
 Henning




 On 24.12.2013 12:18, Henning Blohm wrote:

All that sounds very promising. I will give it a try and let you
 know
 how things worked out.

 Thanks,
 Henning

 On 12/23/2013 08:10 PM, Jesse Yates wrote:

The work that James is referencing grew out of the discussions Lars
 and I
 had (which lead to those blog posts). The solution we implement is
 designed
 to be generic, as James mentioned above, but was written with all the
 hooks
 necessary for Phoenix to do some really fast updates (or skipping

 updates

 in the case where there is no change).

 You should be able to plug in your own simple index builder (there is
 an example
 in the phoenix codebasehttps://github.com/forcedotcom/phoenix/tree/
 master/src/main/java/com/salesforce/hbase/index/covered/example)
 to basic solution which supports the same transactional guarantees as
 HBase
 (per row) + data guarantees across the index rows. There are more

 details

 in the presentations James linked.

 I'd love you see if your implementation can fit into the framework we
 wrote
 - we would be happy to work to see if it needs some more hooks or
 modifications - I have a feeling this is pretty much what you guys

 will

 need

 -Jesse


 On Mon, Dec 23, 2013 at 10:01 AM, James Taylor

 jtay...@salesforce.com

 wrote:

   Henning,

 Jesse Yates wrote the back-end of our global secondary indexing

 system

 in
 Phoenix. He designed it as a separate, pluggable module with no

 Phoenix

 dependencies. Here's an overview of the feature:
 https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The
 section that discusses the data guarantees and failure management

 might

 be
 of interest to you:

  https://github.com/forcedotcom/phoenix/wiki/
 Secondary-Indexing#data-

 guarantees-and-failure-management

 This presentation also gives a good overview of the pluggability of

 his



 --
 Henning Blohm

 *ZFabrik Software KG*

 T:  +49 6227 3984255
 F:  +49 6227 3984254
 M:  +49 1781891820

 Lammstrasse 2 69190 Walldorf

 henning.bl...@zfabrik.de mailto:henning.bl...@zfabrik.de
 Linkedin http://www.linkedin.com/pub/henning-blohm/0/7b5/628
 ZFabrik http://www.zfabrik.de
 Blog http://www.z2-environment.net/blog
 Z2-Environment http://www.z2-environment.eu
 Z2 Wiki http://redmine.z2-environment.net




Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor
Hi LiLi,
Phoenix isn't an experimental project. We're on our 2.2 release, and many
companies (including the company for which I'm employed, Salesforce.com)
use it in production today.
Thanks,
James


On Fri, Jan 3, 2014 at 11:39 PM, Li Li fancye...@gmail.com wrote:

 hi James,
 phoenix seems great but it's now only a experimental project. I
 want to use only hbase. could you tell me the difference of Phoenix
 and hbase? If I use hbase only, how should I design the schema and
 some extra things for my goal? thank you

 On Sat, Jan 4, 2014 at 3:41 AM, James Taylor jtay...@salesforce.com
 wrote:
  On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com
 wrote:
 
  Couple of notes:
  1. When updating to status you essentially add a new rowkey into HBase,
 I
  would give it up all together. The essential requirement seems to point
 at
  retrieving a list of urls in a certain order.
 
  Not sure on this, but seemed to me that setting the status field is
 forcing
  the urls that have been processed to be at the end of the sort order.
 
  2. Wouldn't salting ruin the sort order required? Priority, date added?
 
  No, as Phoenix maintains returning rows in row key order even when
 they're
  salted. We do parallel scans for each bucket and do a merge sort on the
  client, so the cost is pretty low for this (we also provide a way of
  turning this off if your use case doesn't need it).
 
  Two years, JM? Now you're really going to have to start using Phoenix :-)
 
 
  On Friday, January 3, 2014, James Taylor wrote:
 
   Sure, no problem. One addition: depending on the cardinality of your
   priority column, you may want to salt your table to prevent
 hotspotting,
   since you'll have a monotonically increasing date in the key. To do
 that,
   just add  SALT_BUCKETS=n on to your query, where n is the
 number of
   machines in your cluster. You can read more about salting here:
   http://phoenix.incubator.apache.org/salted.html
  
  
   On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:
  
thank you. it's great.
   
On Fri, Jan 3, 2014 at 3:15 PM, James Taylor 
 jtay...@salesforce.com
wrote:
 Hi LiLi,
 Have a look at Phoenix (http://phoenix.incubator.apache.org/).
 It's
  a
SQL
 skin on top of HBase. You can model your schema and issue your
  queries
just
 like you would with MySQL. Something like this:

 // Create table that optimizes for your most common query
 // (i.e. the PRIMARY KEY constraint should be ordered as you'd
 want
   your
 rows ordered)
 CREATE TABLE url_db (
 status TINYINT,
 priority INTEGER NOT NULL,
 added_time DATE,
 url VARCHAR NOT NULL
 CONSTRAINT pk PRIMARY KEY (status, priority, added_time,
 url));

 int lastStatus = 0;
 int lastPriority = 0;
 Date lastAddedTime = new Date(0);
 String lastUrl = ;

 while (true) {
 // Use row value constructor to page through results in
 batches
  of
1000
 String query = 
 SELECT * FROM url_db
 WHERE status=0 AND (status, priority, added_time, url) 
 (?,
  ?,
?,
 ?)
 ORDER BY status, priority, added_time, url
 LIMIT 1000
 PreparedStatement stmt = connection.prepareStatement(query);

 // Bind parameters
 stmt.setInt(1, lastStatus);
 stmt.setInt(2, lastPriority);
 stmt.setDate(3, lastAddedTime);
 stmt.setString(4, lastUrl);
 ResultSet resultSet = stmt.executeQuery();

 while (resultSet.next()) {
 // Remember last row processed so that you can start after
  that
for
 next batch
 lastStatus = resultSet.getInt(1);
 lastPriority = resultSet.getInt(2);
 lastAddedTime = resultSet.getDate(3);
 lastUrl = resultSet.getString(4);

 doSomethingWithUrls();

 UPSERT INTO url_db(status, priority, added_time, url)
 VALUES (1, ?, CURRENT_DATE(), ?);

 }
 }

 If you need to efficiently query on url, add a secondary index
 like
   this:

 CREATE INDEX url_index ON url_db (url);

 Please let me know if you have questions.

 Thanks,
 James




 On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com
 wrote:

 thank you. But I can't use nutch. could you tell me how hbase is
  used
 in nutch? or hbase is only used to store webpage.

 On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  Have a look at http://nutch.apache.org .  Version 2.x uses
 HBase
under
 the
  hood.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Fri, Jan 3, 2014 at 1:12 AM, Li Li 
 



Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread James Taylor
Otis,
I didn't realize Nutch uses HBase underneath. Might be interesting if you
serialized data in a Phoenix-compliant manner, as you could run SQL queries
directly on top of it.

Thanks,
James


On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 Have a look at http://nutch.apache.org .  Version 2.x uses HBase under the
 hood.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:

  hi all,
   I want to use hbase to store all urls(crawled or not crawled).
  And each url will has a column named priority which represent the
  priority of the url. I want to get the top N urls order by priority(if
  priority is the same then url whose timestamp is ealier is prefered).
   in using something like mysql, my client application may like:
   while true:
   select  url from url_db order by priority,addedTime limit
  1000 where status='not_crawled';
   do something with this urls;
   extract more urls and insert them into url_db;
   How should I design hbase schema for this application? Is hbase
  suitable for me?
   I found in this article
 
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
  ,
  they use redis to store urls. I think hbase is originated from
  bigtable and google use bigtable to store webpage, so for huge number
  of urls, I prefer distributed system like hbase.
 



Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread James Taylor
Hi LiLi,
Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL
skin on top of HBase. You can model your schema and issue your queries just
like you would with MySQL. Something like this:

// Create table that optimizes for your most common query
// (i.e. the PRIMARY KEY constraint should be ordered as you'd want your
rows ordered)
CREATE TABLE url_db (
status TINYINT,
priority INTEGER NOT NULL,
added_time DATE,
url VARCHAR NOT NULL
CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));

int lastStatus = 0;
int lastPriority = 0;
Date lastAddedTime = new Date(0);
String lastUrl = ;

while (true) {
// Use row value constructor to page through results in batches of 1000
String query = 
SELECT * FROM url_db
WHERE status=0 AND (status, priority, added_time, url)  (?, ?, ?,
?)
ORDER BY status, priority, added_time, url
LIMIT 1000
PreparedStatement stmt = connection.prepareStatement(query);

// Bind parameters
stmt.setInt(1, lastStatus);
stmt.setInt(2, lastPriority);
stmt.setDate(3, lastAddedTime);
stmt.setString(4, lastUrl);
ResultSet resultSet = stmt.executeQuery();

while (resultSet.next()) {
// Remember last row processed so that you can start after that for
next batch
lastStatus = resultSet.getInt(1);
lastPriority = resultSet.getInt(2);
lastAddedTime = resultSet.getDate(3);
lastUrl = resultSet.getString(4);

doSomethingWithUrls();

UPSERT INTO url_db(status, priority, added_time, url)
VALUES (1, ?, CURRENT_DATE(), ?);

}
}

If you need to efficiently query on url, add a secondary index like this:

CREATE INDEX url_index ON url_db (url);

Please let me know if you have questions.

Thanks,
James




On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:

 thank you. But I can't use nutch. could you tell me how hbase is used
 in nutch? or hbase is only used to store webpage.

 On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  Have a look at http://nutch.apache.org .  Version 2.x uses HBase under
 the
  hood.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:
 
  hi all,
   I want to use hbase to store all urls(crawled or not crawled).
  And each url will has a column named priority which represent the
  priority of the url. I want to get the top N urls order by priority(if
  priority is the same then url whose timestamp is ealier is prefered).
   in using something like mysql, my client application may like:
   while true:
   select  url from url_db order by priority,addedTime limit
  1000 where status='not_crawled';
   do something with this urls;
   extract more urls and insert them into url_db;
   How should I design hbase schema for this application? Is hbase
  suitable for me?
   I found in this article
 
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
  ,
  they use redis to store urls. I think hbase is originated from
  bigtable and google use bigtable to store webpage, so for huge number
  of urls, I prefer distributed system like hbase.
 



Re: secondary index feature

2013-12-23 Thread James Taylor
Henning,
Jesse Yates wrote the back-end of our global secondary indexing system in
Phoenix. He designed it as a separate, pluggable module with no Phoenix
dependencies. Here's an overview of the feature:
https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing. The section
that discusses the data guarantees and failure management might be of
interest to you:
https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#data-guarantees-and-failure-management

This presentation also gives a good overview of the pluggability of his
implementation:
http://files.meetup.com/1350427/PhoenixIndexing-SF-HUG_09-26-13.pptx

Thanks,
James


On Mon, Dec 23, 2013 at 3:47 AM, Henning Blohm henning.bl...@zfabrik.dewrote:

 Lars, that is exactly why I am hesitant to use one the core level generic
 approaches (apart from having difficulties to identify the still active
 projects): I have doubts I can sufficiently explain to myself when and
 where they fail.

 With toolbox approach I meant to say that turning entity data into index
 data is not done generically but rather involving domain specific
 application code that

 - indicates what makes an index key given an entity
 - indicates whether an index entry is still valid given an entity

 That code is also used during the index rebuild and trimming (an M/R Job)

 So validating whether an index entry is valid means to load the entity
 pointed to and - before considering it a valid result - validating whether
 values of the entity still match with the index.

 The entity is written last, hence when the client dies halfway through the
 update you may get stale index entries but nothing else should break.

 For scanning along the index, we are using a chunk iterator that is, we
 read n index entries ahead and then do point look ups for the entities. How
 would you avoid point-gets when scanning via an index (as most likely,
 entities are ordered independently from the index - hence the index)?

 Something really important to note is that there is no intention to build
 a completely generic solution, in particular not (this time - unlike the
 other post of mine you responded to) taking row versioning into account.
 Instead, row time stamps are used to delete stale entries (old entries
 after an index rebuild).

 Thanks a lot for your blog pointers. Haven't had time to study in depth
 but at first glance there is lot of overlap of what you are proposing and
 what I ended up doing considering the first post.

 On the second post: Indeed I have not worried too much about transactional
 isolation of updates. If index update and entity update use the same HBase
 time stamp, the result should at least be consistent, right?

 Btw. in no way am I claiming originality of my thoughts - in particular I
 read http://jyates.github.io/2012/07/09/consistent-enough-
 secondary-indexes.html a while back.

 Thanks,
 Henning

 Ps.: I might write about this discussion later in my blog


 On 22.12.2013 23:37, lars hofhansl wrote:

 The devil is often in the details. On the surface it looks simple.

 How specifically are the stale indexes ignored? Are the guaranteed to be
 no races?
 Is deletion handled correctly?Does it work with multiple versions?
 What happens when the client dies 1/2 way through an update?
 It's easy to do eventually consistent indexes. Truly consistent indexes
 without transactions are tricky.


 Also, scanning an index and then doing point-gets against a main table is
 slow (unless the index is very selective. The Phoenix team measured that
 there is only an advantage if the index filters out 98-99% of the data).
 So then one would revert to covered indexes and suddenly is not so easy
 to detect stale index entries.

 I blogged about these issues here:
 http://hadoop-hbase.blogspot.com/2012/10/musings-on-
 secondary-indexes.html
 http://hadoop-hbase.blogspot.com/2012/10/secondary-indexes-part-ii.html

 Phoenix has a (pretty involved) solution now that works around the fact
 that HBase has no transactions.


 -- Lars



 
   From: Henning Blohm henning.bl...@zfabrik.de
 To: user user@hbase.apache.org
 Sent: Sunday, December 22, 2013 2:11 AM
 Subject: secondary index feature

 Lately we have added a secondary index feature to a persistence tier
 over HBASE. Essentially we implemented what is described as Dual-Write
 Secondary Index in http://hbase.apache.org/book/secondary.indexes.html.
 I.e. while updating an entity, actually before writing the actual
 update, indexes are updated. Lookup via the index ignores stale entries.
 A recurring rebuild and clean out of stale entries takes care the
 indexes are trimmed and accurate.

 None of this was terribly complex to implement. In fact, it seemed like
 something you could do generically, maybe not on the HBASE level itself,
 but as a toolbox / utility style library.

 Is anybody on the list aware of anything useful already existing in that
 space?

 Thanks,
 Henning Blohm

 *ZFabrik Software 

Re: Performance tuning

2013-12-21 Thread James Taylor
FYI, scanner caching defaults to 1000 in Phoenix, but as folks have pointed
out, that's not relevant in this case b/c only a single row is returned
from the server for a COUNT(*) query.


On Sat, Dec 21, 2013 at 2:51 PM, Kristoffer Sjögren sto...@gmail.comwrote:

 Yeah, im doing a count(*) query on the 96 region table. Do you mean to
 check network traffic between RS?

 From debugging phoenix code I can see that there are 96 scans sent and each
 response returned back to the client contain only the sum of rows, which
 are then aggregated and returned. So the traffic between client and each RS
 is very small.




 On Sat, Dec 21, 2013 at 11:35 PM, lars hofhansl la...@apache.org wrote:

  Thanks Kristoffer,
 
  yeah, that's the right metric. I would put my bet on the slower network.
  But you're also doing a select count(*) query in Phoenix, right? So
  nothing should really be sent across the network.
 
  When you do the queries, can you check whether there is any network
  traffic?
 
  -- Lars
 
 
 
  
   From: Kristoffer Sjögren sto...@gmail.com
  To: user@hbase.apache.org; lars hofhansl la...@apache.org
  Sent: Saturday, December 21, 2013 1:28 PM
  Subject: Re: Performance tuning
 
 
  @pradeep scanner caching should not be an issue since data transferred to
  the client is tiny.
 
  @lars Yes, the data might be small for this particular case :-)
 
  I have checked everything I can think of on RS (CPU, network, Hbase
  console, uptime etc) and nothing stands out, except for the pings
 (network
  pings).
  There are 5 regions on 7, 18, 19, and 23 the others have 4.
  hdfsBlocksLocalityIndex=100 on all RS (was that the correct metric?)
 
  -Kristoffer
 
 
 
 
  On Sat, Dec 21, 2013 at 9:44 PM, lars hofhansl la...@apache.org wrote:
 
   Hi Kristoffer,
   For this particular problem. Are many regions on the same
 RegionServers?
   Did you profile those RegionServers? Anything weird on that box?
   Pings slower might well be an issue. How's the data locality? (You can
   check on a RegionServer's overview page).
   If needed, you can issue a major compaction to reestablish local data
 on
   all RegionServers.
  
  
   32 cores matched with only 4G of RAM is a bit weird, but with your tiny
   dataset it doesn't matter anyway.
  
   10m rows across 96 regions is just about 100k rows per region. You
 won't
   see many of the nice properties for HBase.
   Try with 100m (or better 1bn rows). Then we're talking. For anything
  below
   this you wouldn't want to use HBase anyway.
   (100k rows I could scan on my phone with a Perl script in less than 1s)
  
  
   With ping you mean an actual network ping, or some operation on top
 of
   HBase?
  
  
   -- Lars
  
  
  
   
From: Kristoffer Sjögren sto...@gmail.com
   To: user@hbase.apache.org
   Sent: Saturday, December 21, 2013 11:17 AM
   Subject: Performance tuning
  
  
   Hi
  
   I have been performance tuning HBase 0.94.6 running Phoenix 2.2.0 the
  last
   couple of days and need some help.
  
   Background.
  
   - 23 machine cluster, 32 cores, 4GB heap per RS.
   - Table t_24 have 24 online regions (24 salt buckets).
   - Table t_96 have 96 online regions (96 salt buckets).
   - 10.5 million rows per table.
   - Count query - select (*) from ...
   - Group by query - select A, B, C sum(D) from ... where (A = 1 and T
 = 0
   and T = 2147482800) group by A, B, C;
  
   What I found ultimately is that region servers 19, 20, 21, 22 and 23
   are consistently
   2-3x slower than the others. This hurts overall latency pretty bad
 since
   queries are executed in parallel on the RS and then aggregated at the
   client (through Phoenix). In Hannibal regions spread out evenly over
  region
   servers, according to salt buckets (phoenix feature, pre-create regions
  and
   a rowkey prefix).
  
   As far as I can tell, there is no network or hardware configuration
   divergence between the machines. No CPU, network or other notable
   divergence
   in Ganglia. No RS metric differences in HBase master console.
  
   The only thing that may be of interest is that pings (within the
 cluster)
   to
   bad RS is about 2-3x slower, around 0.050ms vs 0.130ms. Not sure if
   this is significant,
   but I get a bad feeling about it since it match exactly with the RS
 that
   stood out in my performance tests.
  
   Any ideas of how I might find the source of this problem?
  
   Cheers,
   -Kristoffer
  
 



Re: Errors :Undefined table and DoNotRetryIOException while querying from phoenix to hbase

2013-12-14 Thread James Taylor
Mathan,
We already answered your question on the Phoenix mailing list. If you
have a follow up question, please post it there. This is not an HBase
issue.
Thanks,
James

On Dec 14, 2013, at 2:10 PM, mathan kumar immathanku...@gmail.com wrote:

 -- Forwarded message --
 From: x ...@gmail.com
 Date: Sat, Dec 14, 2013 at 10:28 AM
 Subject: Re: Errors :Undefined table and DoNotRetryIOException while
 querying from phoenix to hbase
 To: yyy...@gmail.com


 But I could drop the table in HBase using HBase shell and verified that the
 table is not listed in HBase after that.
 Even now Phoenix could not drop the table and still listing the entry while
 using !tables command.

 Where would be the mapping gets stored. Is ter any way to delete the table
 entry from phoenix which may / may not be in HBase.

 or else can I map a HBase table with the name Table1 in HBase to table
 name Table2 in Phoenix.

 coz I found that Table1 name is freezed to the above said errors in
 Phoenix though the table is available or not available in HBase.


 On Sat, Dec 14, 2013 at 5:58 AM, yy...@gmail.comgiacomotay...@gmail.com
 wrote:

 Sounds like an HBase bug. Have you asked on their mailing list?

 For info and restrictions on mapping an existing HBase table to Phoenix,
 see here:
 https://github.com/forcedotcom/phoenix/wiki/Phoenix-Introduction#mapping-to-an-existing-hbase-table

 Thanks,
 James




 On Fri, Dec 13, 2013 at 1:56 PM,  
 xxx...@gmail.comimmathanku...@gmail.com
 wrote:

 0: jdbc:phoenix:localhost !tables

 ++-+++++---++-++
 | TABLE_CAT  | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE |  REMARKS   |
 TYPE_NAME  | SELF_REFERENCING_COL_NAME | REF_GENERATION | INDEX_STATE |
 IMMUTABLE_ROWS |

 ++-+++++---++-++
 | null   | SYSTEM  | TABLE  | SYSTEM TABLE | null   |
 null   | null  | null   | null|
 false |
 | null   | null| MyTable  | TABLE  | null   |
 null   | null  | null   | null|
 false  |

 ++-+++++---++-++


 0: jdbc:phoenix:localhost !dropall
 Really drop every table in the database? (y/n)abort-drop-all: Aborting
 drop all tables.y
 Error: ERROR 1012 (42M03): Table undefined. tableName=MyTable (state=
 42M03,code=1012)
 Aborting command set because force is false and command failed: DROP
 TABLE MyTable;





[ANNOUNCE] Phoenix accepted as Apache incubator

2013-12-13 Thread James Taylor
The Phoenix team is pleased to announce that Phoenix[1] has been accepted
as an Apache incubator project[2]. Over the next several weeks, we'll move
everything over to Apache and work toward our first release.

Happy to be part of the extended family.

Regards,
James

[1] https://github.com/forcedotcom/phoenix
[2] http://incubator.apache.org/projects/phoenix.html


Re: Online/Realtime query with filter and join?

2013-12-02 Thread James Taylor
I agree with Doug Meil's advice. Start with your row key design. In
Phoenix, your PRIMARY KEY CONSTRAINT defines your row key. You should lead
with the columns that you'll filter against most frequently. Then, take a
look at adding secondary indexes to speedup queries against other columns.

Thanks,
James


On Mon, Dec 2, 2013 at 11:01 AM, Pradeep Gollakota pradeep...@gmail.comwrote:

 In addition to Impala and Pheonix, I'm going to throw PrestoDB into the
 mix. :)

 http://prestodb.io/


 On Mon, Dec 2, 2013 at 10:58 AM, Doug Meil doug.m...@explorysmedical.com
 wrote:

 
  You are going to want to figure out a rowkey (or a set of tables with
  rowkeys) to restrict the number of I/O's. If you just slap Impala in
 front
  of HBase (or even Phoenix, for that matter) you could write SQL against
 it
  but if it's winds up doing a full-scan of an Hbase table underneath you
  won't get your  100ms response time.
 
  Note:  I'm not saying you can't do this with Impala or Phoenix, I'm just
  saying start with the rowkeys first so that you limit the I/O.  Then
 start
  adding frameworks as needed (and/or build a schema with Phoenix in the
  same rowkey exercise).
 
  Such response-time requirements make me think that this is for
 application
  support, so why the requirement for SQL? Might want to start writing it
 as
  a Java program first.
 
 
 
 
 
 
 
 
 
  On 11/29/13 4:32 PM, Mourad K mourad...@gmail.com wrote:
 
  You might want to consider something like Impala or Phoenix, I presume
  you are trying to do some report query for dashboard or UI?
  MapReduce is certainly not adequate as there is too much latency on
  startup. If you want to give this a try, cdh4 and Impala are a good
 start.
  
  Mouradk
  
  On 29 Nov 2013, at 10:33, Ramon Wang ra...@appannie.com wrote:
  
   The general performance requirement for each query is less than 100
 ms,
   that's the average level. Sounds crazy, but yes we need to find a way
  for
   it.
  
   Thanks
   Ramon
  
  
   On Fri, Nov 29, 2013 at 5:01 PM, yonghu yongyong...@gmail.com
 wrote:
  
   The question is what you mean of real-time. What is your
 performance
   request? In my opinion, I don't think the MapReduce is suitable for
 the
   real time data processing.
  
  
   On Fri, Nov 29, 2013 at 9:55 AM, Azuryy Yu azury...@gmail.com
 wrote:
  
   you can try phoniex.
   On 2013-11-29 3:44 PM, Ramon Wang ra...@appannie.com wrote:
  
   Hi Folks
  
   It seems to be impossible, but I still want to check if there is a
  way
   we
   can do complex query on HBase with Order By, JOIN.. etc like
 we
   have
   with normal RDBMS, we are asked to provided such a solution for it,
  any
   ideas? Thanks for your help.
  
   BTW, i think maybe impala from CDH would be a way to go, but
 haven't
   got
   time to check it yet.
  
   Thanks
   Ramon
  
 
 



Re: HBase Phoenix questions

2013-11-27 Thread James Taylor
Amit,
So sorry we didn't answer your question before - I'll post an answer now
over on our mailing list.
Thanks,
James


On Wed, Nov 27, 2013 at 8:46 AM, Amit Sela am...@infolinks.com wrote:

 I actually asked some of these questions in the phoenix-hbase-user
 googlegroup but never got an answer...


 On Wed, Nov 27, 2013 at 6:39 PM, Ted Yu yuzhih...@gmail.com wrote:

  Amit:
  Have you subscribed to phoenix-hbase-...@googlegroups.com ?
 
  Cheers
 
 
  On Wed, Nov 27, 2013 at 8:23 AM, Amit Sela am...@infolinks.com wrote:
 
   Hi all,
   I've read a lot of good things about Phoenix here and I have a few
   questions that maybe some of you, who already use Phoenix, can help me
   with:
  
   How does Phoenix handle pre-existing data (before it was deployed) ?
  
   Does the deployment require HBase restart or just RegionServers
 restart ?
  
   How does Phoenix handle values that are data blobs - say my value is
 not
  an
   Integer but a Writable with two members like impressions (Integer) and
   revenue (Float) ? Is it possible to add a sort of an adapter to Phoenix
  for
   such use cases ?
  
   Thanks,
  
   Amit.
  
 



Re: How to get Metadata information in Hbase

2013-11-25 Thread James Taylor
One other tool option for you is to use Phoenix. You use SQL to create a
table and define the columns through standard DDL. Your columns make up the
allowed KeyValues for your table and the metadata is surfaced through the
standard JDBC metadata APIs (with column family mapping to table catalog).

Thanks,
James


On Mon, Nov 25, 2013 at 10:51 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 1.get the information about tables like how many tables and name of
 tables.
 You can use HBaseAdmin from the API. admin.getTableNames()

  2.get the columnfamilies name in each table.
 From the table you can get the descriptor to get the families:
 table.getTableDescriptor();

 3.get the columns names and their data types in each columnfamily.
 to get the ALL the column names, 2 options. 1) You keep track of them in
 your application using constans... 2) you run a full scan map reduce job.

 JM


 2013/11/25 Ted Yu yuzhih...@gmail.com

  bq. 3.get the columns names and their data types in each columnfamily.
 
  Each columnfamily can have arbitrarily many columns. There is no tool
 which
  returns the above information.
 
 
  On Tue, Nov 26, 2013 at 12:26 AM, Asaf Mesika asaf.mes...@gmail.com
  wrote:
 
   bin/hbase shell
   In there:
   Type help and you'll get along
  
   On Monday, November 25, 2013, ashishkshukladb wrote:
  
I want to get the metadata information in Hbase. My basic purpose is
  to -
   
1.get the information about tables like how many tables and name of
   tables.
   
2.get the columnfamilies name in each table.
   
3.get the columns names and their data types in each columnfamily.
   
Is there any tool or command by using which we can get the above
information
??
   
   
   
   
--
View this message in context:
   
  
 
 http://apache-hbase.679495.n3.nabble.com/How-to-get-Metadata-information-in-Hbase-tp4053044.html
Sent from the HBase User mailing list archive at Nabble.com.
   
  
 



Re: HFile block size

2013-11-25 Thread James Taylor
FYI, you can define BLOCKSIZE in your hbase-sites.xml, just like with HBase
to make it global.

Thanks,
James


On Mon, Nov 25, 2013 at 9:08 PM, Azuryy Yu azury...@gmail.com wrote:

 This is no way to declare global property in Phoneix, you have to
 declare BLOCKSIZE
 in each 'create' SQL.

 such as:
 CREATE TABLE IF NOT EXISTS STOCK_SYMBOL(id int, name string)
  BLOOMFILTER='ROW', VERSIONS='1', BLOCKSIZE = '8192' 


 On Tue, Nov 26, 2013 at 12:36 PM, Job Thomas j...@suntecgroup.com wrote:

 
  Hi Jean ,
 
   Thanks You for the support.
 
  we can create table like this , create 'xyz', {NAME = 'cf',  BLOCKSIZE
 =
  '8192'} inorder  to set the block size.
 
  But I am using phoenix to create and query the table. I want to globaly
  declare this property to let all table crated  should take this property.
 
  My priamry aim is to decrease the random read latency by reducing block
  size.
 
  how can I make that?
 
  Best Regards,
  Job M Thomas
 
  
 
  From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
  Sent: Tue 11/26/2013 9:28 AM
  To: user
  Subject: Re: HFile block size
 
 
 
  From the code and the JIRAs:
 
  hbase.hregion.max.filesize is used to configure the size of a region
 (which
  can contain more than one HFile)
 
  hbase.mapreduce.hfileoutputformat.blocksize come from HBase 8949 (While
  writing hfiles from HFileOutputFormat forcing blocksize from table
  schema(HColumnDescriptor).
   Even if we configure hbase.mapreduce.hfileoutputformat.blocksize during
  bulkload/import it
  will be overridden with actual block size from table schema. )
 
  hfile.min.blocksize.size is the old
  hbase.mapreduce.hfileoutputformat.blocksize (See HBase-3864)
 
 
  2013/11/25 Job Thomas j...@suntecgroup.com
 
   Hi all,
  
   Out of these property , which one is used to set  HFile block size in
  hbae
   0.94.12
  
hbase.hregion.max.filesize=16384
  
hfile.min.blocksize.size=16384
  
hbase.mapreduce.hfileoutputformat.blocksize=16384
  
   Best Regards,
   Job M Thomas
  
 
 
 



Re: hbase suitable for churn analysis ?

2013-11-14 Thread James Taylor
We ingest logs using Pig to write Phoenix-compliant HFiles, load those into
HBase and then use Phoenix (https://github.com/forcedotcom/phoenix) to
query directly over the HBase data through SQL.

Regards,
James


On Thu, Nov 14, 2013 at 9:35 AM, sam wu swu5...@gmail.com wrote:

 we ingest data from log (one file/table, per event, per date) into HBase
 offline on daily basis. So we can get no_day info.
 My thoughts for churn analysis based on two types of user.
 green (young, maybe  7 days in system), predict churn based on first 7?
 days activity, ideally predict while the user is still logging into the
 system, and if the churn probablity is high, reward sweets to keep them
 stay longer.
 Senior user, predict churn based on weekly? summary.

 One thought to accomplish this is to have one detailed daily table, and
 some summary (weekly?) table. new daily data get ingested into daily table.
 Once every week, summary/move some old daily data into weekly table



 On Thu, Nov 14, 2013 at 9:15 AM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

  I'm a little curious as to how you would be able to use no_of_days as a
  column qualifier at all... it changes everyday for all users right? So
 how
  will you keep your table updated?
 
 
  On Thu, Nov 14, 2013 at 9:07 AM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
   You can use your no_day as a column qualifier probably.
  
   The column families are best suitable to regroup column qualifiers with
  the
   same access (read/write) pattern. So if all your columns qualifiers
 have
   the same pattern, simply put them on the same familly.
  
   JM
  
  
   2013/11/14 sam wu swu5...@gmail.com
  
Thanks for the advise.
What about key is userId + no_day(since user registered), and column
   family
is each typeEvent, and qualifier is the detailed trxs.
   
   
On Thu, Nov 14, 2013 at 8:51 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:
   
 Hi Sam,

 So are you saying that you will have about 30 column families? If
 so
  I
 don't think tit's a good idea.

 JM


 2013/11/13 Sam Wu swu5...@gmail.com

  Hi all,
 
  I am thinking about using Random Forest to do churn analysis with
   Hbase
 as
  NoSQL data store.
  Currently,  we have all the user history (basically many type of
   event
  data)  resides in S3  Redshift (we have one table per date/per
   event)
  Events includes startTime, endTime, and other pertinent
   information,..
 
  We are thinking about converting all the event tables into one
 fat
  table(with other helper parameter tables) with one row per user
  using
 Hbase.
 
  Each row will have user id as key, with some
  column-family/qualifier,
  e.g.: col-family, d1,d2,……d30 (days in the system), and qualifier
  as
  different types of event.  Since initially we are more interested
  in
new
  user retention, so 30 days might be good to start with.
 
  We can label record as churning away by no active activity in
continuous
  10 days.
 
  If data schema looks good, ingest data from S3 into HBase. Then
 do
Random
  Forest to classifier new profile data.
 
  Is this types of data a good candidate for Hbase.
  Opinion is highly appreciated.
 
 
  BR
 
  Sam

   
  
 



Re: HBASE help

2013-10-28 Thread James Taylor
Take a look at Phoenix (https://github.com/forcedotcom/phoenix) which
will allow you to issue SQL to directly create tables, insert data,
and run queries over HBase using the data model described below.
Thanks,
James

On Oct 28, 2013, at 8:47 AM, saiprabhur saiprab...@gmail.com wrote:

 Hi Folks,

 New to NOSQL designing data model for primary care system. i have normalized
 sample DB relationship model e.g. HBASE-0.94.0

 Patient table:
 
 1) Patient_id - PK
 2) Added_BY
 3) Gender
 4) Usual_GP

 Patient Name table: [One to many relationship with patient [One] Name[Many]]
 1) Name_id
 2) Patient_id - FK
 3) Name_type
 4) First name
 5) Last Name
 6) Middle name

 Patient address table: [One to many relationship with patient [One]
 address[Many]]
 1) Address_id
 2) Patient_id - FK
 3) Address_type
 4) Line1
 5) Line2
 6) Line 3
 7) Line 4
 8) Line 5

 Patient Phone table: [One to many relationship with patient [One]
 Phone[Many]]
 1) Phone_id
 2) Patient_id - FK
 3) PhoneType
 4) Phoneno
 5) ext

 Medication and other details
 1)Entry_id
 2)Patient_id - FK [One to many relationship with patient [One]
 Medication[Many]]
 3)Start_date
 4)End_date
 5)Code
 6)Medicine description
 7)Dosage details
 8) Number of authorised
 9) Number issued

 For above Normalised data model i have created sample NoSQL data model
 below, i Hope data model works for Document base NOSQL. Need to convert
 below Data model in to HBASE column based data model, pls help me.

 Patient :{Patient_id:22,
 Added_by:Doctor1,
 Gender:Male,
 UsualGP: Doctor2,
 PatName:[
 {NameType:Usual, FirstName:Hari, LastName:prasad,Middlename:' '},
 {NameType:Other, FirstName:John,LastName:prasad,Middlename:kenndy}
 ]
 PatAddr:[
 {AddType:Usual, Line1:2, Line2:Harrington road, Line3: near central,
 Line4:Newyork,Line5:NY008},
 {AddType:Tmp, Line1:2, Line2:Mylapore road, Line3: near Zoo,
 Line4:WashingtonDC,Line5:WA00098}
 ]
 PatPhone:[
 {PhoneType:Usual, Phoneno:4453443344, ext:099},
 {PhoneType:Tmp, Phoneno:9198332342343, ext:}
 ]
 PatMedication:[
 {MedStardate:'01/01/2013', MedEndDate:'', Code:'Snomode',
 MedDesc:'Paracetmol', DosDet:'Take 2 daily', Noauth: 5, Issue: 3},
 {MedStardate:'01/05/2013', MedEndDate:'01/05/2013', Code:'readcode',
 MedDesc:'Avil', DosDet:'Take 1 daily', Noauth: 3, Issue: 1},
 {MedStardate:'01/10/2013', MedEndDate:'24/10/13', Code:'readcode',
 MedDesc:'Metacin', DosDet:'Take 2 daily', Noauth: 5, Issue: 3},
 ]
}



 --
 View this message in context: 
 http://apache-hbase.679495.n3.nabble.com/HBASE-help-tp4052238.html
 Sent from the HBase User mailing list archive at Nabble.com.


Re: [ANNOUNCE] Phoenix v 2.1 released

2013-10-28 Thread James Taylor
Can't say I blame you, as it's a bit abstract. At Salesforce, we use it to
support query-more, where you want to be able to page through your data.
Without this feature, you have no way of establishing your prior position
to be able to get the next batch. This allows the client to jump right back
where they left off, with the query compiling down to setting a start row
on the scan, assuming you're navigating along your primary key or secondary
index axis (i.e. your row value constructor and order by match your primary
key constraint or your secondary indexed columns).

A second use case is if you have a composite primary key and you have a set
of say 1000 rows out of a billion for which you'd like to query. Using the
the IN syntax I outlined here:
https://github.com/forcedotcom/phoenix/wiki/Row-Value-Constructors, you can
now do this is a single query and it'll be super fast (i.e. as fast or
faster than a batched get).

Thanks,
James


On Mon, Oct 28, 2013 at 11:14 AM, Asaf Mesika asaf.mes...@gmail.com wrote:

 I couldn't get the Row Value Constructor feature.
 Do you perhaps have a real world use case to demonstrate this?

 On Friday, October 25, 2013, James Taylor wrote:

  The Phoenix team is pleased to announce the immediate availability of
  Phoenix 2.1 [1].
  More than 20 individuals contributed to the release. Here are some of the
  new features
  now available:
  * Secondary Indexing [2] to create and automatically maintain global
  indexes over your
 primary table.
 - Queries automatically use an index when more efficient, turning your
  full table scans
   into point and range scans.
 - Multiple columns may be indexed in ascending or descending sort
 order.
 - Additional primary table columns may be included in the index to
 form
  a covered
   index.
 - Available in two flavors:
  o Server-side index maintenance for mutable data.
  o Client-side index maintenance optimized for write-once,
  append-only use cases.
  * Row Value Constructors [3], a standard SQL construct to efficiently
  locate the row at
or after a composite key value.
 - Enables a query-more capability to efficiently step through your
 data.
 - Optimizes IN list of composite key values to be point gets.
  * Map-reduce based CSV Bulk Loader [4] to build Phoenix-compliant HFiles
  and load
them into HBase.
  * MD5 hash and INVERT built-in functions
 
  Phoenix 2.1 requires HBase 0.94.4 or above, with 0.94.10 or above
 required
  for mutable secondary indexing. For the best performance, we recommend
  HBase 0.94.12 or above.
 
  Regards,
 
  James
  @JamesPlusPlus
  http://phoenix-hbase.blogspot.com/
 
  [1] https://github.com/forcedotcom/phoenix/wiki/Download
  [2] https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing
  [3] https://github.com/forcedotcom/phoenix/wiki/Row-Value-Constructors
  [4]
 
 
 https://github.com/forcedotcom/phoenix/wiki/Bulk-CSV-loading-through-map-reduce
 



[ANNOUNCE] Phoenix v 2.1 released

2013-10-24 Thread James Taylor
The Phoenix team is pleased to announce the immediate availability of
Phoenix 2.1 [1].
More than 20 individuals contributed to the release. Here are some of the
new features
now available:
* Secondary Indexing [2] to create and automatically maintain global
indexes over your
   primary table.
   - Queries automatically use an index when more efficient, turning your
full table scans
 into point and range scans.
   - Multiple columns may be indexed in ascending or descending sort order.
   - Additional primary table columns may be included in the index to form
a covered
 index.
   - Available in two flavors:
o Server-side index maintenance for mutable data.
o Client-side index maintenance optimized for write-once,
append-only use cases.
* Row Value Constructors [3], a standard SQL construct to efficiently
locate the row at
  or after a composite key value.
   - Enables a query-more capability to efficiently step through your data.
   - Optimizes IN list of composite key values to be point gets.
* Map-reduce based CSV Bulk Loader [4] to build Phoenix-compliant HFiles
and load
  them into HBase.
* MD5 hash and INVERT built-in functions

Phoenix 2.1 requires HBase 0.94.4 or above, with 0.94.10 or above required
for mutable secondary indexing. For the best performance, we recommend
HBase 0.94.12 or above.

Regards,

James
@JamesPlusPlus
http://phoenix-hbase.blogspot.com/

[1] https://github.com/forcedotcom/phoenix/wiki/Download
[2] https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing
[3] https://github.com/forcedotcom/phoenix/wiki/Row-Value-Constructors
[4]
https://github.com/forcedotcom/phoenix/wiki/Bulk-CSV-loading-through-map-reduce


Re: [ANNOUNCE] Phoenix v 2.1 released

2013-10-24 Thread James Taylor
Thanks, Ted. That was a typo which I've corrected. Yes, these are
references to columns from your primary table. It should have read like
this:

CREATE INDEX my_index ON my_table (v2 DESC, v1) INCLUDE (v3)
SALT_BUCKETS=10, DATA_BLOCK_ENCODING='NONE';


On Thu, Oct 24, 2013 at 5:40 PM, Ted Yu yuzhih...@gmail.com wrote:

 From https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing :
 Is date_col a column from data table ?

 CREATE INDEX my_index ON my_table (date_col DESC, v1) INCLUDE (v3)
 SALT_BUCKETS=10, DATA_BLOCK_ENCODING='NONE';



 On Thu, Oct 24, 2013 at 5:24 PM, James Taylor jtay...@salesforce.com
 wrote:

  The Phoenix team is pleased to announce the immediate availability of
  Phoenix 2.1 [1].
  More than 20 individuals contributed to the release. Here are some of the
  new features
  now available:
  * Secondary Indexing [2] to create and automatically maintain global
  indexes over your
 primary table.
 - Queries automatically use an index when more efficient, turning your
  full table scans
   into point and range scans.
 - Multiple columns may be indexed in ascending or descending sort
 order.
 - Additional primary table columns may be included in the index to
 form
  a covered
   index.
 - Available in two flavors:
  o Server-side index maintenance for mutable data.
  o Client-side index maintenance optimized for write-once,
  append-only use cases.
  * Row Value Constructors [3], a standard SQL construct to efficiently
  locate the row at
or after a composite key value.
 - Enables a query-more capability to efficiently step through your
 data.
 - Optimizes IN list of composite key values to be point gets.
  * Map-reduce based CSV Bulk Loader [4] to build Phoenix-compliant HFiles
  and load
them into HBase.
  * MD5 hash and INVERT built-in functions
 
  Phoenix 2.1 requires HBase 0.94.4 or above, with 0.94.10 or above
 required
  for mutable secondary indexing. For the best performance, we recommend
  HBase 0.94.12 or above.
 
  Regards,
 
  James
  @JamesPlusPlus
  http://phoenix-hbase.blogspot.com/
 
  [1] https://github.com/forcedotcom/phoenix/wiki/Download
  [2] https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing
  [3] https://github.com/forcedotcom/phoenix/wiki/Row-Value-Constructors
  [4]
 
 
 https://github.com/forcedotcom/phoenix/wiki/Bulk-CSV-loading-through-map-reduce
 



Re: row filter - binary comparator at certain range

2013-10-21 Thread James Taylor
Take a look at Phoenix(https://github.com/forcedotcom/phoenix). It supports
both salting and fuzzy row filtering through its skip scan.


On Sun, Oct 20, 2013 at 10:42 PM, Premal Shah premal.j.s...@gmail.comwrote:

 Have you looked at FuzzyRowFilter? Seems to me that it might satisfy your
 use-case.

 http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/


 On Sun, Oct 20, 2013 at 9:31 PM, Tony Duan duanjian...@126.com wrote:

  Alex Vasilenko aa.vasilenko@... writes:
 
  
   Lars,
  
   But how it will behave, when I have salt at the beginning of the key to
   properly shard table across regions? Imagine row key of format
   salt:timestamp and rows goes like this:
   ...
   1:15
   1:16
   1:17
   1:23
   2:3
   2:5
   2:12
   2:15
   2:19
   2:25
   ...
  
   And I want to find all rows, that has second part (timestamp) in range
   15-25. What startKey and endKey should be used?
  
   Alexandr Vasilenko
   Web Developer
   Skype:menterr
   mob: +38097-611-45-99
  
   2012/2/9 lars hofhansl lhofhansl@...
  Hi,
  Alexandr Vasilenko
  Have you ever resolved this issue?i am also facing this iusse.
  i also want implement this functionality.
  Imagine row key of format
  salt:timestamp and rows goes like this:
  ...
  1:15
  1:16
  1:17
  1:23
  2:3
  2:5
  2:12
  2:15
  2:19
  2:25
  ...
 
  And I want to find all rows, that has second part (timestamp) in range
  15-25.
 
  Could you please tell me how you resolve this ?
  thanks  in advance.
 
 
  Tony duan
 
 


 --
 Regards,
 Premal Shah.



Re: row filter - binary comparator at certain range

2013-10-21 Thread James Taylor
Phoenix restricts salting to a single byte.
Salting perhaps is misnamed, as the salt byte is a stable hash based on the
row key.
Phoenix's skip scan supports sub-key ranges.
We've found salting in general to be faster (though there are cases where
it's not), as it ensures better parallelization.

Regards,
James



On Mon, Oct 21, 2013 at 9:14 AM, Vladimir Rodionov
vrodio...@carrieriq.comwrote:

 FuzzyRowFilter does not work on sub-key ranges.
 Salting is bad for any scan operation, unfortunately. When salt prefix
 cardinality is small (1-2 bytes),
 one can try something similar to FuzzyRowFilter but with additional
 sub-key range support.
 If salt prefix cardinality is high ( 2 bytes) - do a full scan with your
 own Filter (for timestamp ranges).

 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com

 
 From: Premal Shah [premal.j.s...@gmail.com]
 Sent: Sunday, October 20, 2013 10:42 PM
 To: user
 Subject: Re: row filter - binary comparator at certain range

 Have you looked at FuzzyRowFilter? Seems to me that it might satisfy your
 use-case.

 http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/


 On Sun, Oct 20, 2013 at 9:31 PM, Tony Duan duanjian...@126.com wrote:

  Alex Vasilenko aa.vasilenko@... writes:
 
  
   Lars,
  
   But how it will behave, when I have salt at the beginning of the key to
   properly shard table across regions? Imagine row key of format
   salt:timestamp and rows goes like this:
   ...
   1:15
   1:16
   1:17
   1:23
   2:3
   2:5
   2:12
   2:15
   2:19
   2:25
   ...
  
   And I want to find all rows, that has second part (timestamp) in range
   15-25. What startKey and endKey should be used?
  
   Alexandr Vasilenko
   Web Developer
   Skype:menterr
   mob: +38097-611-45-99
  
   2012/2/9 lars hofhansl lhofhansl@...
  Hi,
  Alexandr Vasilenko
  Have you ever resolved this issue?i am also facing this iusse.
  i also want implement this functionality.
  Imagine row key of format
  salt:timestamp and rows goes like this:
  ...
  1:15
  1:16
  1:17
  1:23
  2:3
  2:5
  2:12
  2:15
  2:19
  2:25
  ...
 
  And I want to find all rows, that has second part (timestamp) in range
  15-25.
 
  Could you please tell me how you resolve this ?
  thanks  in advance.
 
 
  Tony duan
 
 


 --
 Regards,
 Premal Shah.

 Confidentiality Notice:  The information contained in this message,
 including any attachments hereto, may be confidential and is intended to be
 read only by the individual or entity to whom this message is addressed. If
 the reader of this message is not the intended recipient or an agent or
 designee of the intended recipient, please note that any review, use,
 disclosure or distribution of this message or its attachments, in any form,
 is strictly prohibited.  If you have received this message in error, please
 immediately notify the sender and/or notificati...@carrieriq.com and
 delete or destroy any copy of this message and its attachments.



Re: row filter - binary comparator at certain range

2013-10-21 Thread James Taylor
What do you think it should be called, because
prepending-row-key-with-single-hashed-byte doesn't have a very good ring
to it. :-)

Agree that getting the row key design right is crucial.

The range of prepending-row-key-with-single-hashed-byte is declarative
when you create your table in Phoenix, so you typically declare an upper
bound based on your cluster size (not 255, but maybe 8 or 16). We've run
the numbers and it's typically faster, but as with most things, not always.

HTH,
James


On Mon, Oct 21, 2013 at 1:05 PM, Michael Segel msegel_had...@hotmail.comwrote:

 Then its not a SALT. And please don't use the term 'salt' because it has
 specific meaning outside to what you want it to mean.  Just like saying
 HBase has ACID because you write the entire row as an atomic element.  But
 I digress….

 Ok so to your point…

 1 byte == 255 possible values.

 So which will be faster.

 creating a list of the 1 byte truncated hash of each possible timestamp in
 your range, or doing 255 separate range scans with the start and stop range
 key set?

 That will give you the results you want, however… I'd go back and have
 them possibly rethink the row key if they can … assuming this is the base
 access pattern.

 HTH

 -Mike





 On Oct 21, 2013, at 11:37 AM, James Taylor jtay...@salesforce.com wrote:

  Phoenix restricts salting to a single byte.
  Salting perhaps is misnamed, as the salt byte is a stable hash based on
 the
  row key.
  Phoenix's skip scan supports sub-key ranges.
  We've found salting in general to be faster (though there are cases where
  it's not), as it ensures better parallelization.
 
  Regards,
  James
 
 
 
  On Mon, Oct 21, 2013 at 9:14 AM, Vladimir Rodionov
  vrodio...@carrieriq.comwrote:
 
  FuzzyRowFilter does not work on sub-key ranges.
  Salting is bad for any scan operation, unfortunately. When salt prefix
  cardinality is small (1-2 bytes),
  one can try something similar to FuzzyRowFilter but with additional
  sub-key range support.
  If salt prefix cardinality is high ( 2 bytes) - do a full scan with
 your
  own Filter (for timestamp ranges).
 
  Best regards,
  Vladimir Rodionov
  Principal Platform Engineer
  Carrier IQ, www.carrieriq.com
  e-mail: vrodio...@carrieriq.com
 
  
  From: Premal Shah [premal.j.s...@gmail.com]
  Sent: Sunday, October 20, 2013 10:42 PM
  To: user
  Subject: Re: row filter - binary comparator at certain range
 
  Have you looked at FuzzyRowFilter? Seems to me that it might satisfy
 your
  use-case.
 
 
 http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/
 
 
  On Sun, Oct 20, 2013 at 9:31 PM, Tony Duan duanjian...@126.com wrote:
 
  Alex Vasilenko aa.vasilenko@... writes:
 
 
  Lars,
 
  But how it will behave, when I have salt at the beginning of the key
 to
  properly shard table across regions? Imagine row key of format
  salt:timestamp and rows goes like this:
  ...
  1:15
  1:16
  1:17
  1:23
  2:3
  2:5
  2:12
  2:15
  2:19
  2:25
  ...
 
  And I want to find all rows, that has second part (timestamp) in range
  15-25. What startKey and endKey should be used?
 
  Alexandr Vasilenko
  Web Developer
  Skype:menterr
  mob: +38097-611-45-99
 
  2012/2/9 lars hofhansl lhofhansl@...
  Hi,
  Alexandr Vasilenko
  Have you ever resolved this issue?i am also facing this iusse.
  i also want implement this functionality.
  Imagine row key of format
  salt:timestamp and rows goes like this:
  ...
  1:15
  1:16
  1:17
  1:23
  2:3
  2:5
  2:12
  2:15
  2:19
  2:25
  ...
 
  And I want to find all rows, that has second part (timestamp) in range
  15-25.
 
  Could you please tell me how you resolve this ?
  thanks  in advance.
 
 
  Tony duan
 
 
 
 
  --
  Regards,
  Premal Shah.
 
  Confidentiality Notice:  The information contained in this message,
  including any attachments hereto, may be confidential and is intended
 to be
  read only by the individual or entity to whom this message is
 addressed. If
  the reader of this message is not the intended recipient or an agent or
  designee of the intended recipient, please note that any review, use,
  disclosure or distribution of this message or its attachments, in any
 form,
  is strictly prohibited.  If you have received this message in error,
 please
  immediately notify the sender and/or notificati...@carrieriq.com and
  delete or destroy any copy of this message and its attachments.
 




Re: row filter - binary comparator at certain range

2013-10-21 Thread James Taylor
We don't truncate the hash, we mod it. Why would you expect that data
wouldn't be evenly distributed? We've not seen this to be the case.



On Mon, Oct 21, 2013 at 1:48 PM, Michael Segel msegel_had...@hotmail.comwrote:

 What do you call hashing the row key?
 Or hashing the row key and then appending the row key to the hash?
 Or hashing the row key, truncating the hash value to some subset and then
 appending the row key to the value?

 The problem is that there is specific meaning to the term salt. Re-using
 it here will cause confusion because you're implying something you don't
 mean to imply.

 you could say prepend a truncated hash of the key, however… is prepend a
 real word? ;-) (I am sorry, I am not a grammar nazi, nor an English major. )

 So even outside of Phoenix, the concept is the same.
 Even with a truncated hash, you will find that over time, all but the tail
 N regions will only be half full.
 This could be both good and bad.

 (Where N is your number 8 or 16 allowable hash values.)

 You've solved potentially one problem… but still have other issues that
 you need to address.
 I guess the simple answer is to double the region sizes and not care that
 most of your regions will be 1/2 the max size…  but the size you really
 want and 8-16 regions will be up to twice as big.



 On Oct 21, 2013, at 3:26 PM, James Taylor jtay...@salesforce.com wrote:

  What do you think it should be called, because
  prepending-row-key-with-single-hashed-byte doesn't have a very good
 ring
  to it. :-)
 
  Agree that getting the row key design right is crucial.
 
  The range of prepending-row-key-with-single-hashed-byte is declarative
  when you create your table in Phoenix, so you typically declare an upper
  bound based on your cluster size (not 255, but maybe 8 or 16). We've run
  the numbers and it's typically faster, but as with most things, not
 always.
 
  HTH,
  James
 
 
  On Mon, Oct 21, 2013 at 1:05 PM, Michael Segel 
 msegel_had...@hotmail.comwrote:
 
  Then its not a SALT. And please don't use the term 'salt' because it has
  specific meaning outside to what you want it to mean.  Just like saying
  HBase has ACID because you write the entire row as an atomic element.
  But
  I digress….
 
  Ok so to your point…
 
  1 byte == 255 possible values.
 
  So which will be faster.
 
  creating a list of the 1 byte truncated hash of each possible timestamp
 in
  your range, or doing 255 separate range scans with the start and stop
 range
  key set?
 
  That will give you the results you want, however… I'd go back and have
  them possibly rethink the row key if they can … assuming this is the
 base
  access pattern.
 
  HTH
 
  -Mike
 
 
 
 
 
  On Oct 21, 2013, at 11:37 AM, James Taylor jtay...@salesforce.com
 wrote:
 
  Phoenix restricts salting to a single byte.
  Salting perhaps is misnamed, as the salt byte is a stable hash based on
  the
  row key.
  Phoenix's skip scan supports sub-key ranges.
  We've found salting in general to be faster (though there are cases
 where
  it's not), as it ensures better parallelization.
 
  Regards,
  James
 
 
 
  On Mon, Oct 21, 2013 at 9:14 AM, Vladimir Rodionov
  vrodio...@carrieriq.comwrote:
 
  FuzzyRowFilter does not work on sub-key ranges.
  Salting is bad for any scan operation, unfortunately. When salt prefix
  cardinality is small (1-2 bytes),
  one can try something similar to FuzzyRowFilter but with additional
  sub-key range support.
  If salt prefix cardinality is high ( 2 bytes) - do a full scan with
  your
  own Filter (for timestamp ranges).
 
  Best regards,
  Vladimir Rodionov
  Principal Platform Engineer
  Carrier IQ, www.carrieriq.com
  e-mail: vrodio...@carrieriq.com
 
  
  From: Premal Shah [premal.j.s...@gmail.com]
  Sent: Sunday, October 20, 2013 10:42 PM
  To: user
  Subject: Re: row filter - binary comparator at certain range
 
  Have you looked at FuzzyRowFilter? Seems to me that it might satisfy
  your
  use-case.
 
 
 
 http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/
 
 
  On Sun, Oct 20, 2013 at 9:31 PM, Tony Duan duanjian...@126.com
 wrote:
 
  Alex Vasilenko aa.vasilenko@... writes:
 
 
  Lars,
 
  But how it will behave, when I have salt at the beginning of the key
  to
  properly shard table across regions? Imagine row key of format
  salt:timestamp and rows goes like this:
  ...
  1:15
  1:16
  1:17
  1:23
  2:3
  2:5
  2:12
  2:15
  2:19
  2:25
  ...
 
  And I want to find all rows, that has second part (timestamp) in
 range
  15-25. What startKey and endKey should be used?
 
  Alexandr Vasilenko
  Web Developer
  Skype:menterr
  mob: +38097-611-45-99
 
  2012/2/9 lars hofhansl lhofhansl@...
  Hi,
  Alexandr Vasilenko
  Have you ever resolved this issue?i am also facing this iusse.
  i also want implement this functionality.
  Imagine row key of format
  salt:timestamp and rows goes like this:
  ...
  1:15
  1:16
  1:17
  1

Re: row filter - binary comparator at certain range

2013-10-21 Thread James Taylor
One thing I neglected to mention is that the table is pre-split at the
prepending-row-key-with-single-hashed-byte boundaries, so the expectation
is that you'd allocate enough buckets that you don't end up needing to
splitting the regions. But if you under allocate (i.e. allocate too small a
SALT_BUCKETS value), then I see your point.

Thanks,
James


On Mon, Oct 21, 2013 at 5:58 PM, Michael Segel michael_se...@hotmail.comwrote:

 James,

 Its evenly distributed, however... because its a time stamp, its a 'tail
 end charlie' addition.
 So when you split a region, the top half is never added to, so you end up
 with all regions half filled except for the last region in each 'modded'
 value.

 I wouldn't say its a bad thing if you plan for it.

 On Oct 21, 2013, at 5:07 PM, James Taylor jtay...@salesforce.com wrote:

  We don't truncate the hash, we mod it. Why would you expect that data
  wouldn't be evenly distributed? We've not seen this to be the case.
 
 
 
  On Mon, Oct 21, 2013 at 1:48 PM, Michael Segel 
 msegel_had...@hotmail.comwrote:
 
  What do you call hashing the row key?
  Or hashing the row key and then appending the row key to the hash?
  Or hashing the row key, truncating the hash value to some subset and
 then
  appending the row key to the value?
 
  The problem is that there is specific meaning to the term salt. Re-using
  it here will cause confusion because you're implying something you don't
  mean to imply.
 
  you could say prepend a truncated hash of the key, however… is prepend a
  real word? ;-) (I am sorry, I am not a grammar nazi, nor an English
 major. )
 
  So even outside of Phoenix, the concept is the same.
  Even with a truncated hash, you will find that over time, all but the
 tail
  N regions will only be half full.
  This could be both good and bad.
 
  (Where N is your number 8 or 16 allowable hash values.)
 
  You've solved potentially one problem… but still have other issues that
  you need to address.
  I guess the simple answer is to double the region sizes and not care
 that
  most of your regions will be 1/2 the max size…  but the size you really
  want and 8-16 regions will be up to twice as big.
 
 
 
  On Oct 21, 2013, at 3:26 PM, James Taylor jtay...@salesforce.com
 wrote:
 
  What do you think it should be called, because
  prepending-row-key-with-single-hashed-byte doesn't have a very good
  ring
  to it. :-)
 
  Agree that getting the row key design right is crucial.
 
  The range of prepending-row-key-with-single-hashed-byte is
 declarative
  when you create your table in Phoenix, so you typically declare an
 upper
  bound based on your cluster size (not 255, but maybe 8 or 16). We've
 run
  the numbers and it's typically faster, but as with most things, not
  always.
 
  HTH,
  James
 
 
  On Mon, Oct 21, 2013 at 1:05 PM, Michael Segel 
  msegel_had...@hotmail.comwrote:
 
  Then its not a SALT. And please don't use the term 'salt' because it
 has
  specific meaning outside to what you want it to mean.  Just like
 saying
  HBase has ACID because you write the entire row as an atomic element.
  But
  I digress….
 
  Ok so to your point…
 
  1 byte == 255 possible values.
 
  So which will be faster.
 
  creating a list of the 1 byte truncated hash of each possible
 timestamp
  in
  your range, or doing 255 separate range scans with the start and stop
  range
  key set?
 
  That will give you the results you want, however… I'd go back and have
  them possibly rethink the row key if they can … assuming this is the
  base
  access pattern.
 
  HTH
 
  -Mike
 
 
 
 
 
  On Oct 21, 2013, at 11:37 AM, James Taylor jtay...@salesforce.com
  wrote:
 
  Phoenix restricts salting to a single byte.
  Salting perhaps is misnamed, as the salt byte is a stable hash based
 on
  the
  row key.
  Phoenix's skip scan supports sub-key ranges.
  We've found salting in general to be faster (though there are cases
  where
  it's not), as it ensures better parallelization.
 
  Regards,
  James
 
 
 
  On Mon, Oct 21, 2013 at 9:14 AM, Vladimir Rodionov
  vrodio...@carrieriq.comwrote:
 
  FuzzyRowFilter does not work on sub-key ranges.
  Salting is bad for any scan operation, unfortunately. When salt
 prefix
  cardinality is small (1-2 bytes),
  one can try something similar to FuzzyRowFilter but with additional
  sub-key range support.
  If salt prefix cardinality is high ( 2 bytes) - do a full scan with
  your
  own Filter (for timestamp ranges).
 
  Best regards,
  Vladimir Rodionov
  Principal Platform Engineer
  Carrier IQ, www.carrieriq.com
  e-mail: vrodio...@carrieriq.com
 
  
  From: Premal Shah [premal.j.s...@gmail.com]
  Sent: Sunday, October 20, 2013 10:42 PM
  To: user
  Subject: Re: row filter - binary comparator at certain range
 
  Have you looked at FuzzyRowFilter? Seems to me that it might satisfy
  your
  use-case.
 
 
 
 
 http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes

Re: Write TimeSeries Data and Do Time Based Range Scans

2013-09-24 Thread James Taylor
Hey Anil,
The solution you've described is the best we've found for Phoenix (inspired
by the work of Alex at Sematext).
You can do all of this in a few lines of SQL:

CREATE TABLE event_data(
who VARCHAR, type SMALLINT, id BIGINT, when DATE, payload VARBINARY
CONSTRAINT pk PRIMARY KEY (who, type, id))
IMMUTABLE_ROWS=true;  // Declare event table as having immutable rows
CREATE INDEX event_data_index ON event_data(when, type, who)
INCLUDE(payload)
SALT_BUCKETS=10;  // Salt the index since it'll create write hotspots
otherwise

The following query would display event count per type across all users
over the last week.
It would automatically use the index:

SELECT type, count(*) FROM event_data WHERE when  CURRENT_DATE() - 7 GROUP
BY type

The following query would display the event count by type for a particular
user. It would
automatically use the data table:

SELECT who, type, count(*) FROM event_data WHERE who = ? GROUP BY who, type;

As far as the read cost associated with reading from a salted table, we've
found in most cases
it actually performs better, because you get better parallelization. The
case where it performs
worse is on a selective query that returns a smallish set of rows that
normally would be in the same
block. In this case, you're reading an entire block for each row, where in
the worst case these
would be neighbors in the same block on an unsalted table.

HTH,

James

On Tue, Sep 24, 2013 at 8:12 AM, anil gupta anilgupt...@gmail.com wrote:

 Inline

 On Mon, Sep 23, 2013 at 6:15 PM, Shahab Yunus shahab.yu...@gmail.com
 wrote:

  Yeah, I saw that. In fact that is why I recommended that to you as I
  couldn't infer from your email that whether you have already gone through
  that source or not.

 Yes, i was aware of that article. But my read pattern is slighty different
 from that article.We are using HBase as DataSource for a RestFul service.
 Even though if my range scan finds 400 rows with a specified timerange. I
 only return top 20 for one rest request. So, if in case i do bucketing(lets
 say bucket=10) then i will need to fetch 20 results from each bucket and
 then i will have to do a merge sort on the client size and return final 20.
 You can assume that i need to return the 20rows sorted by timestamp.




  A source, who did the exact same thing and discuss it
  in much more detail and concerns aligning with yours (in fact I think
 some
  of the authors/creators of that link/group are members here of this
  community as well.)

 Do you know what the outcome of their experiment? Do you have any link for
 that? Thanks for your time and help.


 
  Regards,
  Shahab
 
 
  On Mon, Sep 23, 2013 at 8:41 PM, anil gupta anilgupt...@gmail.com
 wrote:
 
   Hi Shahab,
  
   If you read my solution carefully. I am already doing that.
  
   Thanks,
   Anil Gupta
  
  
   On Mon, Sep 23, 2013 at 3:51 PM, Shahab Yunus shahab.yu...@gmail.com
   wrote:
  
   
   
  
 
 http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
   
Here you can find the discussion, trade-offs and working code/API
 (even
   for
M/R) about this and the approach you are trying out.
   
Regards,
Shahab
   
   
On Mon, Sep 23, 2013 at 5:41 PM, anil gupta anilgupt...@gmail.com
   wrote:
   
 Hi All,

 I have a secondary index(inverted index) table with a rowkey on the
   basis
 of Timestamp of an event. Assume the rowkey as TimeStamp in
 Epoch.
 I also store some extra(apart from main_table rowkey) columns in
 that
table
 for doing filtering.

 The requirement is to do range-based scan on the basis of time of
 event.  Hence, the index with this rowkey.
 I cannot use Hashing or MD5 digest solution because then i cannot
 do
range
 based scans.  And, i already have a index like OpenTSDB in another
   table
 for the same dataset.(I have many secondary Index for same data
 set.)

 Problem: When we increase the write workload during stress test.
 Time
 secondary index becomes a bottleneck due to the famous Region
   HotSpotting
 problem.
 Solution: I am thinking of adding a prefix of { (TimeStamp in
Epoch%10) =
 bucket}  in the rowkey. Then my row key will become:
  BucketTimeStamp in Epoch
 By using above rowkey i can at least alleviate *WRITE* problem.(i
  don't
 think problem can be fixed permanently because of the use case
requirement.
 I would love to be proven wrong.)
 However, with the above row key, now when i want to *READ* data,
 for
every
 single range scans i have to read data from 10 different regions.
  This
 extra load for read is scaring me a bit.

 I am wondering if anyone has better suggestion/approach to solve
 this
 problem given the constraints i have.  Looking for feedback from
community.

 --
 Thanks  Regards,
 Anil Gupta

   
  
  
  
   --
   Thanks  

Re: deploy saleforce phoenix coprocessor to hbase/lib??

2013-09-11 Thread James Taylor
Tian-Ying,
A Phoenix table is an HBase table. At create time, if the HBase table
doesn't exist, we create it initially with the right metadata (so no alter
table is necessary). If the HBase table already exists, then we compare the
existing table meta with the expected table metadata. If it's different,
then we issue an alter table.

You need to restart RS after the deploy of the jar under hbase/lib. Since
the RS is already running, it won't have the Phoenix jar on the classpath
yet (as it wasn't there when you started it). If/when we move to the model
of storing the phoenix jar in HDFS, then you won't have to restart the
first time you deploy. However, for any upgrade to the Phoenix jar, you
will need to restart since that's currently the only way to unload the old
jar and load the new jar.

Thanks,
James


On Wed, Sep 11, 2013 at 11:37 AM, Tianying Chang tich...@ebaysf.com wrote:

 James, thanks for the explain.

 So my understanding is the Phoenix wraps around HBase client API to create
 a  Pheonix table. Within this wrapper, it will call a alter table with
 the coprocessor when it create a phoenix table, right?

 Also, do we need to restart RS after deploy the jar under hbase/lib? Our
 customers said it has to. But I feel it is unnecessary and weird. Can you
 confirm?

 Thanks
 Tian-Ying

 -Original Message-
 From: James Taylor [mailto:jtay...@salesforce.com]
 Sent: Tuesday, September 10, 2013 4:40 PM
 To: user@hbase.apache.org
 Subject: Re: deploy saleforce phoenix coprocessor to hbase/lib??

 When a table is created with Phoenix, its HBase table is configured with
 the Phoenix coprocessors. We do not specify a jar path, so the Phoenix jar
 that contains the coprocessor implementation classes must be on the
 classpath of the region server.

 In addition to coprocessors, Phoenix relies on custom filters which are
 also in the Phoenix jar. In theory you could put the jar in HDFS, use the
 relatively new HBase feature to load custom filters from HDFS, and issue
 alter table calls for existing Phoenix HBase tables to reconfigure the
 coprocessors. When new Phoenix tables are created, though, they wouldn't
 have this jar path.

 FYI, we're looking into modifying our install procedure to do the above
 (see https://github.com/forcedotcom/phoenix/issues/216), if folks are
 interested in contributing.

 Thanks,
 James

 On Sep 10, 2013, at 2:41 PM, Tianying Chang tich...@ebaysf.com wrote:

  Hi,
 
  Since this is not a hbase system level jar, instead, it is more like
 user code, should we deploy it under hbase/lib?  It seems we can use
 alter to add the coprocessor for a particular user table.  So I can put
 the jar file any place that is accessible, e.g. hdfs:/myPath?
 
  My customer said, there is no need to run 'aler' command. Instead, as
 long as I put the jar into hbase/lib, then when phoenix client make read
 call, it will add the the coprocessor attr into that table being read. It
 is kind of suspicious. Does the phoenix client call a alter under cover
 for the client  already?
 
  Anyone knows about this?
 
  Thanks
  Tian-Ying



Re: deploy saleforce phoenix coprocessor to hbase/lib??

2013-09-10 Thread James Taylor
When a table is created with Phoenix, its HBase table is configured
with the Phoenix coprocessors. We do not specify a jar path, so the
Phoenix jar that contains the coprocessor implementation classes must
be on the classpath of the region server.

In addition to coprocessors, Phoenix relies on custom filters which
are also in the Phoenix jar. In theory you could put the jar in HDFS,
use the relatively new HBase feature to load custom filters from HDFS,
and issue alter table calls for existing Phoenix HBase tables to
reconfigure the coprocessors. When new Phoenix tables are created,
though, they wouldn't have this jar path.

FYI, we're looking into modifying our install procedure to do the
above (see https://github.com/forcedotcom/phoenix/issues/216), if
folks are interested in contributing.

Thanks,
James

On Sep 10, 2013, at 2:41 PM, Tianying Chang tich...@ebaysf.com wrote:

 Hi,

 Since this is not a hbase system level jar, instead, it is more like user 
 code, should we deploy it under hbase/lib?  It seems we can use alter to 
 add the coprocessor for a particular user table.  So I can put the jar file 
 any place that is accessible, e.g. hdfs:/myPath?

 My customer said, there is no need to run 'aler' command. Instead, as long as 
 I put the jar into hbase/lib, then when phoenix client make read call, it 
 will add the the coprocessor attr into that table being read. It is kind of 
 suspicious. Does the phoenix client call a alter under cover for the client 
  already?

 Anyone knows about this?

 Thanks
 Tian-Ying


Re: 答复: Fastest way to get count of records in huge hbase table?

2013-09-10 Thread James Taylor
Use Phoenix (https://github.com/forcedotcom/phoenix) by doing the following:
CREATE VIEW myHTableName (key VARBINARY NOT NULL PRIMARY KEY);
SELECT COUNT(*) FROM myHTableName;

As fenghong...@xiaomi.com said, you still need to scan the table, but
Phoenix will do it in parallel and use a coprocessor and an internal
scanner API to speed things up.

Thanks,
James
@JamesPlusPlus


On Tue, Sep 10, 2013 at 7:01 PM, 冯宏华 fenghong...@xiaomi.com wrote:

 No fast way to get the count of records of a table without scanning and
 counting, especially when you want to get the accurate count. By design the
 data/cells of a same record/row can scatter in many different HFiles and
 memstore, so even we can record the count of records of each HFile as meta
 in FileInfo, we still need to de-dup to get the accurate total count, which
 only can be achieved by scanning.
 
 发件人: Ramasubramanian Narayanan [ramasubramanian.naraya...@gmail.com]
 发送时间: 2013年9月10日 16:07
 收件人: user@hbase.apache.org
 主题: Fastest way to get count of records in huge hbase table?

 Dear All,

 Is there any fastest way to get the count of records in a huge HBASE table
 with billions of records?

 The normal count command is running for a hour with this huge volume of
 data..

 regards,
 Rams



Re: Concurrent connections to Hbase

2013-09-05 Thread James Taylor
Hey Kiru,
The Phoenix team would be happy to work with you to benchmark your
performance if you can give us specifics about your schema design, queries,
and data sizes. We did something similar for Sudarshan for a Bloomberg use
case here[1].

Thanks,
James

[1]. http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/34697


On Thu, Sep 5, 2013 at 10:28 AM, Kiru Pakkirisamy kirupakkiris...@yahoo.com
 wrote:

 Hi All,
 I'd like to hear from users who are running a  big HBase setup with
 multiple concurrent connections.
 Woud like to know the -# of cores/machines, # of queries. Get/RPCs , Hbase
 version etc.
 We are trying to build an application with sub-second query performance
 (using coprocessors)  and want to scale it out to 10s of thousands of
 concurrent queries. We are now at 500-600 do see a bug like HBASE-9410
 Any positive/negative experiences in a similar situation ?


 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com


Re: HBase - stable versions

2013-09-04 Thread James Taylor
+1 to what Nicolas said.

That goes for Phoenix as well. It's open source too. We do plan to port to
0.96 when our user community (Salesforce.com, of course, being one of them)
demands it.

Thanks,
James


On Wed, Sep 4, 2013 at 10:11 AM, Nicolas Liochon nkey...@gmail.com wrote:

 It's open source. My personal point of view is that if someone is willing
 to spend time on the backport, there should be no issue, if the regression
 risk is clearly acceptable  the rolling restart possible. If it's
 necessary (i.e. there is no agreement of the risk level), then we could as
 well go for a 94.12.1 solution. I don't think we need to create this branch
 now: this branch should be created on when and if we cannot find an
 agreement on a specific jira.

 Nicolas



 On Wed, Sep 4, 2013 at 6:53 PM, lars hofhansl la...@apache.org wrote:

  I should also explicitly state that we (Salesforce) will stay with 0.94
  for the foreseeable future.
 
  We will continue backport fixes that we need. If those are not acceptable
  or accepted into the open source 0.94 branch, they will have to go into
 an
  Salesforce internal repository.
  I would really like to avoid that (essentially a fork), so I would offer
  to start having stable tags, i.e. we keep making changes in 0.94.x, and
  declare (say) 0.94.12 stable and have 0.94.12.1, etc, releases (much like
  what is done in Linux)
 
  We also currently have no resources to port Phoenix over to 0.96 (but if
  somebody wanted to step up, that would be greatly appreciated, of
 course).
 
  Thoughts? Comments? Concerns?
 
  -- Lars
 
 
  - Original Message -
  From: lars hofhansl la...@apache.org
  To: hbase-dev d...@hbase.apache.org; hbase-user user@hbase.apache.org
  Cc:
  Sent: Tuesday, September 3, 2013 5:30 PM
  Subject: HBase - stable versions
 
  With 0.96 being imminent we should start a discussion about continuing
  support for 0.94.
 
  0.92 became stale pretty soon after 0.94 was released.
  The relationship between 0.94 and 0.96 is slightly different, though:
 
  1. 0.92.x could be upgraded to 0.94.x without downtime
  2. 0.92 clients and servers are mutually compatible with 0.94 clients and
  servers
  3. the user facing API stayed backward compatible
 
  None of the above is true when moving from 0.94 to 0.96+.
  Upgrade from 0.94 to 0.96 will require a one-way upgrade process
 including
  downtime, and client and server need to be upgraded in lockstep.
 
  I would like to have an informal poll about who's using 0.94 and is
  planning to continue to use it; and who is planning to upgrade from 0.94
 to
  0.96.
  Should we officially continue support for 0.94? How long?
 
  Thanks.
 
  -- Lars
 
 



Re: how to export data from hbase to mysql?

2013-08-27 Thread James Taylor
Or if you'd like to be able to use SQL directly on it, take a look at
Phoenix (https://github.com/forcedotcom/phoenix).

James

On Aug 27, 2013, at 8:14 PM, Jean-Marc Spaggiari
jean-m...@spaggiari.org wrote:

 Take a look at sqoop?
 Le 2013-08-27 23:08, ch huang justlo...@gmail.com a écrit :

 hi,all:
 any good idea? thanks



Re: Client Get vs Coprocessor scan performance

2013-08-19 Thread James Taylor
Kiru,
Is the column qualifier for the key value storing the double different
for different rows? Not sure I understand what you're grouping over.
Maybe  5 rows worth of sample input and expected output would help.
Thanks,
James


On Aug 19, 2013, at 1:37 AM, Kiru Pakkirisamy kirupakkiris...@yahoo.com wrote:

 James,
 I have only one family -cp. Yes, that is how I store the Double. No, the 
 doubles are always positive.
 The keys are A14568  Less than a million and I added the alphabets to 
 randomize them.
 I group them based on the C_ suffix and say order them by the Double (to 
 simplify it).
 Is there a way  to do a sort of user defined function on a column  ? that 
 would take care of my calculation on that double.
 Thanks again.

 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com


 
 From: James Taylor jtay...@salesforce.com
 To: Kiru Pakkirisamy kirupakkiris...@yahoo.com
 Cc: user@hbase.apache.org user@hbase.apache.org
 Sent: Sunday, August 18, 2013 5:34 PM
 Subject: Re: Client Get vs Coprocessor scan performance


 Kiru,
 What's your column family name? Just to confirm, the column qualifier of
 your key value is C_10345 and this stores a value as a Double using
 Bytes.toBytes(double)? Are any of the Double values negative? Any other key
 values?

 Can you give me an idea of the kind of fuzzy filtering you're doing on the
 7 char row key? We may want to model that as a set of row key columns in
 Phoenix to leverage the skip scan more.

 How about I model your aggregation as an AVG over a group of rows? What
 would your GROUP BY expression look like? Are you grouping based on a part
 of the 7 char row key? Or on some other key value?

 Thanks,
 James


 On Sun, Aug 18, 2013 at 2:16 PM, Kiru Pakkirisamy kirupakkiris...@yahoo.com
 wrote:

 James,
 Rowkey - String - len - 7
 Col = String - variable length - but looks C_10345
 Col value = Double

 If I can create a Phoenix schema mapping to this existing table that would
 be great. I actually do a group by the column values and return another
 value which is a function of the value and an input double value. Input is
 a MapString, Double and return is also a MapString, Double.


 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com

--
   *From:* James Taylor jtay...@salesforce.com
 *To:* user@hbase.apache.org; Kiru Pakkirisamy kirupakkiris...@yahoo.com
 *Sent:* Sunday, August 18, 2013 2:07 PM

 *Subject:* Re: Client Get vs Coprocessor scan performance

 Kiru,
 If you're able to post the key values, row key structure, and data types
 you're using, I can post the Phoenix code to query against it. You're doing
 some kind of aggregation too, right? If you could explain that part too,
 that would be helpful. It's likely that you can just query the existing
 HBase data you've already created on the same cluster you're already using
 (provided you put the phoenix jar on all the region servers - use our 2.0.0
 version that just came out). Might be interesting to compare the amount of
 code necessary in each approach as well.
 Thanks,
 James


 On Sun, Aug 18, 2013 at 12:16 PM, Kiru Pakkirisamy 
 kirupakkiris...@yahoo.com wrote:

 James,
 I am using the FuzzyRowFilter or the Gets within  a Coprocessor. Looks
 like I cannot use your SkipScanFilter by itself as it has lots of phoenix
 imports. I thought of writing my own Custom filter and saw that the
 FuzzyRowFilter in the 0.94 branch also had an implementation for
 getNextKeyHint(),  only that it works well only with fixed length keys if I
 wanted a complete match of the keys. After my padding my keys to fixed
 length it seems to be fine.
 Once I confirm some key locality and other issues (like heap), I will try
 to bench mark this table alone against Phoenix on another cluster. Thanks.

 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com


 
   From: James Taylor jtay...@salesforce.com
 To: user@hbase.apache.org user@hbase.apache.org
 Cc: Kiru Pakkirisamy kirupakkiris...@yahoo.com
 Sent: Sunday, August 18, 2013 11:44 AM
 Subject: Re: Client Get vs Coprocessor scan performance


 Would be interesting to compare against Phoenix's Skip Scan
 (
 http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html
 )
 which does a scan through a coprocessor and is more than 2x faster
 than multi Get (plus handles multi-range scans in addition to point
 gets).

 James

 On Aug 18, 2013, at 6:39 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. Get'ting 100 rows seems to be faster than the FuzzyRowFilter (mask on
 the whole length of the key)

 In this case the Get's are very selective. The number of rows
 FuzzyRowFilter
 was evaluated against would be much higher.
 It would be nice if you remember the time each took.

 bq. Also, I am seeing very bad concurrent query performance

 Were the multi Get's performed by your coprocessor within region boundary
 of the respective

Re: Client Get vs Coprocessor scan performance

2013-08-18 Thread James Taylor
Would be interesting to compare against Phoenix's Skip Scan
(http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html)
which does a scan through a coprocessor and is more than 2x faster
than multi Get (plus handles multi-range scans in addition to point
gets).

James

On Aug 18, 2013, at 6:39 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. Get'ting 100 rows seems to be faster than the FuzzyRowFilter (mask on
 the whole length of the key)

 In this case the Get's are very selective. The number of rows FuzzyRowFilter
 was evaluated against would be much higher.
 It would be nice if you remember the time each took.

 bq. Also, I am seeing very bad concurrent query performance

 Were the multi Get's performed by your coprocessor within region boundary
 of the respective coprocessor ? Just to confirm.

 bq. that would make Coprocessors almost single threaded across multiple
 invocations ?

 Let me dig into code some more.

 Cheers


 On Sat, Aug 17, 2013 at 10:34 PM, Kiru Pakkirisamy 
 kirupakkiris...@yahoo.com wrote:

 Ted,
 On a table with 600K rows, Get'ting 100 rows seems to be faster than the
 FuzzyRowFilter (mask on the whole length of the key). I thought the
 FuzzyRowFilter's  SEEK_NEXT_USING_HINT would help.  All this on the client
 side, I have not changed my CoProcessor to use the FuzzyRowFilter based on
 the client side performance (still doing multiple get inside the
 coprocessor). Also, I am seeing very bad concurrent query performance. Are
 there any thing that would make Coprocessors almost single threaded across
 multiple invocations ?
 Again, all this after putting in 0.94.10 (for hbase-6870 sake) which seems
 to be very good in bringing up the regions online fast and balanced. Thanks
 and much appreciated.

 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com


 
 From: Ted Yu yuzhih...@gmail.com
 To: user@hbase.apache.org user@hbase.apache.org
 Sent: Saturday, August 17, 2013 4:19 PM
 Subject: Re: Client Get vs Coprocessor scan performance


 HBASE-6870 targeted whole table scanning for each coprocessorService call
 which exhibited itself through:

 HTable#coprocessorService - getStartKeysInRange - getStartEndKeys -
 getRegionLocations - MetaScanner.allTableRegions(getConfiguration(),
 getTableName(), false)

 The cached region locations in HConnectionImplementation would be used.

 Cheers


 On Sat, Aug 17, 2013 at 2:21 PM, Asaf Mesika asaf.mes...@gmail.com
 wrote:

 Ted, can you elaborate a little bit why this issue boosts performance?
 I couldn't figure out from the issue comments if they execCoprocessor
 scans
 the entire .META. table or and entire table, to understand the actual
 improvement.

 Thanks!




 On Fri, Aug 9, 2013 at 8:44 AM, Ted Yu yuzhih...@gmail.com wrote:

 I think you need HBASE-6870 which went into 0.94.8

 Upgrading should boost coprocessor performance.

 Cheers

 On Aug 8, 2013, at 10:21 PM, Kiru Pakkirisamy 
 kirupakkiris...@yahoo.com

 wrote:

 Ted,
 Here is the method signature/protocol
 public MapString, Double getFooMapString, Double input,
 int topN) throws IOException;

 There are 31 regions on 4 nodes X 8 CPU.
 I am on 0.94.6 (from Hortonworks).
 I think it seems to behave like what linwukang says, - it is almost a
 full table scan in the coprocessor.
 Actually, when I set more specific ColumnPrefixFilters performance
 went
 down.
 I want to do things on the server side because, I dont want to be
 sending 500K column/values to the client.
 I cannot believe a single-threaded client which does some
 calculations
 and group-by  beats the coprocessor running in 31 regions.

 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com


 
 From: Ted Yu yuzhih...@gmail.com
 To: user@hbase.apache.org; Kiru Pakkirisamy 
 kirupakkiris...@yahoo.com

 Sent: Thursday, August 8, 2013 8:40 PM
 Subject: Re: Client Get vs Coprocessor scan performance


 Can you give us a bit more information ?

 How do you deliver the 55 rowkeys to your endpoint ?
 How many regions do you have for this table ?

 What HBase version are you using ?

 Thanks

 On Thu, Aug 8, 2013 at 6:43 PM, Kiru Pakkirisamy
 kirupakkiris...@yahoo.comwrote:

 Hi,
 I am finding an odd behavior with the Coprocessor performance
 lagging
 a
 client side Get.
 I have a table with 50 rows. Each have variable # of columns in
 one
 column family (in this case about 60 columns in total are
 processed)
 When I try to get specific 55 rows, the client side completes in
 half-the
 time as the coprocessor endpoint.
 I am using  55 RowFilters on the Coprocessor scan side. The rows are
 processed are exactly the same way in both the cases.
 Any pointers on how to debug this scenario ?

 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com





Re: Client Get vs Coprocessor scan performance

2013-08-18 Thread James Taylor
Kiru,
If you're able to post the key values, row key structure, and data types
you're using, I can post the Phoenix code to query against it. You're doing
some kind of aggregation too, right? If you could explain that part too,
that would be helpful. It's likely that you can just query the existing
HBase data you've already created on the same cluster you're already using
(provided you put the phoenix jar on all the region servers - use our 2.0.0
version that just came out). Might be interesting to compare the amount of
code necessary in each approach as well.
Thanks,
James


On Sun, Aug 18, 2013 at 12:16 PM, Kiru Pakkirisamy 
kirupakkiris...@yahoo.com wrote:

 James,
 I am using the FuzzyRowFilter or the Gets within  a Coprocessor. Looks
 like I cannot use your SkipScanFilter by itself as it has lots of phoenix
 imports. I thought of writing my own Custom filter and saw that the
 FuzzyRowFilter in the 0.94 branch also had an implementation for
 getNextKeyHint(),  only that it works well only with fixed length keys if I
 wanted a complete match of the keys. After my padding my keys to fixed
 length it seems to be fine.
 Once I confirm some key locality and other issues (like heap), I will try
 to bench mark this table alone against Phoenix on another cluster. Thanks.

 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com


 
  From: James Taylor jtay...@salesforce.com
 To: user@hbase.apache.org user@hbase.apache.org
 Cc: Kiru Pakkirisamy kirupakkiris...@yahoo.com
 Sent: Sunday, August 18, 2013 11:44 AM
 Subject: Re: Client Get vs Coprocessor scan performance


 Would be interesting to compare against Phoenix's Skip Scan
 (
 http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html
 )
 which does a scan through a coprocessor and is more than 2x faster
 than multi Get (plus handles multi-range scans in addition to point
 gets).

 James

 On Aug 18, 2013, at 6:39 AM, Ted Yu yuzhih...@gmail.com wrote:

  bq. Get'ting 100 rows seems to be faster than the FuzzyRowFilter (mask on
  the whole length of the key)
 
  In this case the Get's are very selective. The number of rows
 FuzzyRowFilter
  was evaluated against would be much higher.
  It would be nice if you remember the time each took.
 
  bq. Also, I am seeing very bad concurrent query performance
 
  Were the multi Get's performed by your coprocessor within region boundary
  of the respective coprocessor ? Just to confirm.
 
  bq. that would make Coprocessors almost single threaded across multiple
  invocations ?
 
  Let me dig into code some more.
 
  Cheers
 
 
  On Sat, Aug 17, 2013 at 10:34 PM, Kiru Pakkirisamy 
  kirupakkiris...@yahoo.com wrote:
 
  Ted,
  On a table with 600K rows, Get'ting 100 rows seems to be faster than the
  FuzzyRowFilter (mask on the whole length of the key). I thought the
  FuzzyRowFilter's  SEEK_NEXT_USING_HINT would help.  All this on the
 client
  side, I have not changed my CoProcessor to use the FuzzyRowFilter based
 on
  the client side performance (still doing multiple get inside the
  coprocessor). Also, I am seeing very bad concurrent query performance.
 Are
  there any thing that would make Coprocessors almost single threaded
 across
  multiple invocations ?
  Again, all this after putting in 0.94.10 (for hbase-6870 sake) which
 seems
  to be very good in bringing up the regions online fast and balanced.
 Thanks
  and much appreciated.
 
  Regards,
  - kiru
 
 
  Kiru Pakkirisamy | webcloudtech.wordpress.com
 
 
  
  From: Ted Yu yuzhih...@gmail.com
  To: user@hbase.apache.org user@hbase.apache.org
  Sent: Saturday, August 17, 2013 4:19 PM
  Subject: Re: Client Get vs Coprocessor scan performance
 
 
  HBASE-6870 targeted whole table scanning for each coprocessorService
 call
  which exhibited itself through:
 
  HTable#coprocessorService - getStartKeysInRange - getStartEndKeys -
  getRegionLocations - MetaScanner.allTableRegions(getConfiguration(),
  getTableName(), false)
 
  The cached region locations in HConnectionImplementation would be used.
 
  Cheers
 
 
  On Sat, Aug 17, 2013 at 2:21 PM, Asaf Mesika asaf.mes...@gmail.com
  wrote:
 
  Ted, can you elaborate a little bit why this issue boosts performance?
  I couldn't figure out from the issue comments if they execCoprocessor
  scans
  the entire .META. table or and entire table, to understand the actual
  improvement.
 
  Thanks!
 
 
 
 
  On Fri, Aug 9, 2013 at 8:44 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  I think you need HBASE-6870 which went into 0.94.8
 
  Upgrading should boost coprocessor performance.
 
  Cheers
 
  On Aug 8, 2013, at 10:21 PM, Kiru Pakkirisamy 
  kirupakkiris...@yahoo.com
 
  wrote:
 
  Ted,
  Here is the method signature/protocol
  public MapString, Double getFooMapString, Double input,
  int topN) throws IOException;
 
  There are 31 regions on 4 nodes X 8 CPU.
  I am on 0.94.6 (from Hortonworks).
  I think it seems to behave like

Re: Client Get vs Coprocessor scan performance

2013-08-18 Thread James Taylor
Kiru,
What's your column family name? Just to confirm, the column qualifier of
your key value is C_10345 and this stores a value as a Double using
Bytes.toBytes(double)? Are any of the Double values negative? Any other key
values?

Can you give me an idea of the kind of fuzzy filtering you're doing on the
7 char row key? We may want to model that as a set of row key columns in
Phoenix to leverage the skip scan more.

How about I model your aggregation as an AVG over a group of rows? What
would your GROUP BY expression look like? Are you grouping based on a part
of the 7 char row key? Or on some other key value?

Thanks,
James


On Sun, Aug 18, 2013 at 2:16 PM, Kiru Pakkirisamy kirupakkiris...@yahoo.com
 wrote:

 James,
 Rowkey - String - len - 7
 Col = String - variable length - but looks C_10345
 Col value = Double

 If I can create a Phoenix schema mapping to this existing table that would
 be great. I actually do a group by the column values and return another
 value which is a function of the value and an input double value. Input is
 a MapString, Double and return is also a MapString, Double.


 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com

   --
  *From:* James Taylor jtay...@salesforce.com
 *To:* user@hbase.apache.org; Kiru Pakkirisamy kirupakkiris...@yahoo.com
 *Sent:* Sunday, August 18, 2013 2:07 PM

 *Subject:* Re: Client Get vs Coprocessor scan performance

 Kiru,
 If you're able to post the key values, row key structure, and data types
 you're using, I can post the Phoenix code to query against it. You're doing
 some kind of aggregation too, right? If you could explain that part too,
 that would be helpful. It's likely that you can just query the existing
 HBase data you've already created on the same cluster you're already using
 (provided you put the phoenix jar on all the region servers - use our 2.0.0
 version that just came out). Might be interesting to compare the amount of
 code necessary in each approach as well.
 Thanks,
 James


 On Sun, Aug 18, 2013 at 12:16 PM, Kiru Pakkirisamy 
 kirupakkiris...@yahoo.com wrote:

 James,
 I am using the FuzzyRowFilter or the Gets within  a Coprocessor. Looks
 like I cannot use your SkipScanFilter by itself as it has lots of phoenix
 imports. I thought of writing my own Custom filter and saw that the
 FuzzyRowFilter in the 0.94 branch also had an implementation for
 getNextKeyHint(),  only that it works well only with fixed length keys if I
 wanted a complete match of the keys. After my padding my keys to fixed
 length it seems to be fine.
 Once I confirm some key locality and other issues (like heap), I will try
 to bench mark this table alone against Phoenix on another cluster. Thanks.

 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com


 
  From: James Taylor jtay...@salesforce.com
 To: user@hbase.apache.org user@hbase.apache.org
 Cc: Kiru Pakkirisamy kirupakkiris...@yahoo.com
 Sent: Sunday, August 18, 2013 11:44 AM
 Subject: Re: Client Get vs Coprocessor scan performance


 Would be interesting to compare against Phoenix's Skip Scan
 (
 http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html
 )
 which does a scan through a coprocessor and is more than 2x faster
 than multi Get (plus handles multi-range scans in addition to point
 gets).

 James

 On Aug 18, 2013, at 6:39 AM, Ted Yu yuzhih...@gmail.com wrote:

  bq. Get'ting 100 rows seems to be faster than the FuzzyRowFilter (mask on
  the whole length of the key)
 
  In this case the Get's are very selective. The number of rows
 FuzzyRowFilter
  was evaluated against would be much higher.
  It would be nice if you remember the time each took.
 
  bq. Also, I am seeing very bad concurrent query performance
 
  Were the multi Get's performed by your coprocessor within region boundary
  of the respective coprocessor ? Just to confirm.
 
  bq. that would make Coprocessors almost single threaded across multiple
  invocations ?
 
  Let me dig into code some more.
 
  Cheers
 
 
  On Sat, Aug 17, 2013 at 10:34 PM, Kiru Pakkirisamy 
  kirupakkiris...@yahoo.com wrote:
 
  Ted,
  On a table with 600K rows, Get'ting 100 rows seems to be faster than the
  FuzzyRowFilter (mask on the whole length of the key). I thought the
  FuzzyRowFilter's  SEEK_NEXT_USING_HINT would help.  All this on the
 client
  side, I have not changed my CoProcessor to use the FuzzyRowFilter based
 on
  the client side performance (still doing multiple get inside the
  coprocessor). Also, I am seeing very bad concurrent query performance.
 Are
  there any thing that would make Coprocessors almost single threaded
 across
  multiple invocations ?
  Again, all this after putting in 0.94.10 (for hbase-6870 sake) which
 seems
  to be very good in bringing up the regions online fast and balanced.
 Thanks
  and much appreciated.
 
  Regards,
  - kiru
 
 
  Kiru Pakkirisamy | webcloudtech.wordpress.com

Re: [ANNOUNCE] Secondary Index in HBase - from Huawei

2013-08-13 Thread James Taylor
Fantastic! Let me know if you're up for surfacing this through Phoenix.
Regards,
James


On Tue, Aug 13, 2013 at 7:48 AM, Anil Gupta anilgupt...@gmail.com wrote:

 Excited to see this!

 Best Regards,
 Anil

 On Aug 13, 2013, at 6:17 AM, zhzf jeff jeff.z...@gmail.com wrote:

  very google local index solution.
 
 
  2013/8/13 Ted Yu yuzhih...@gmail.com
 
  Nice.
 
  Will pay attention to upcoming patch on HBASE-9203.
 
  On Aug 12, 2013, at 11:19 PM, rajeshbabu chintaguntla 
  rajeshbabu.chintagun...@huawei.com wrote:
 
  Hi,
 
  We have been working on implementing secondary index in HBase, and had
  shared an overview of our design in the 2012  Hadoop Technical
 Conference
  at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source
 it
  today.
 
  The project is available on github.
  https://github.com/Huawei-Hadoop/hindex
 
  It is 100% Java, compatible with Apache HBase 0.94.8, and is open
  sourced under Apache Software License v2.
 
  Following features are supported currently.
  -  multiple indexes on table,
  -  multi column index,
  -  index based on part of a column value,
  -  equals and range condition scans using index, and
  -  bulk loading data to indexed table (Indexing done with bulk
  load)
 
  We now plan to raise HBase JIRA(s) to make it available in Apache
  release, and can hopefully continue our work on this in the community.
 
  Regards
  Rajeshbabu
 



Re: Client Get vs Coprocessor scan performance

2013-08-12 Thread James Taylor
Hey Kiru,
Another option for you may be to use Phoenix (
https://github.com/forcedotcom/phoenix). In particular, our skip scan may
be what you're looking for:
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html.
Under-the-covers, the skip scan is doing a series of parallel scans taking
advantage of both coprocessors and the SEEK_NEXT_USING_HINT. As you can
see, it's more than 2x faster than the batched get approach. On top of
that, your queries do not only have to be doing point gets, but range scans
leverage it as well.
Thanks,
James
@JamesPlusPlus


On Sat, Aug 10, 2013 at 11:15 PM, Kiru Pakkirisamy 
kirupakkiris...@yahoo.com wrote:

 Maybe I spoke too soon. HBASE-6870 fixes the table scan (as verified by
 metrics of read requests on the region).
 But the performance with RowFilter is very bad (actually worse than a full
 table scan, dont know how this can happen).API
 I hope my API usage is right. All I am doing is add RowFilters to
 FilterList and setFilter on the scan.
 I tried looking into the AggregateImplementation  (which is mentioned as
 unit test for this bug)  but did not follow through because I am in a rush
 for a good workaround.
 I have now replaced RowFilters with a Get on the Region (in a loop) after
 making sure my key is within startKey and endKey of the region.
 I think this is getting my data right. Performance is very good, almost
 half that of the full scan code we had in the coprocessor earlier.
 Are there any gotchas/bad side-effects to using a Get on the Region ?

 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com


 
  From: Kiru Pakkirisamy kirupakkiris...@yahoo.com
 To: user@hbase.apache.org user@hbase.apache.org
 Sent: Friday, August 9, 2013 1:04 PM
 Subject: Re: Client Get vs Coprocessor scan performance


 I think this fixes my issues. On our dev cluster what used to take 1200
 msec is now in the 700-800 msec region. Thanks again.
 I will be soon deploying this to our Performance cluster where our query
 is at 15 secs range.

 Regards,
 - kiru


 Kiru Pakkirisamy | webcloudtech.wordpress.com


 
 From: Ted Yu yuzhih...@gmail.com
 To: user@hbase.apache.org user@hbase.apache.org
 Cc: user@hbase.apache.org user@hbase.apache.org
 Sent: Thursday, August 8, 2013 10:44 PM
 Subject: Re: Client Get vs Coprocessor scan performance


 I think you need HBASE-6870 which went into 0.94.8

 Upgrading should boost coprocessor performance.

 Cheers

 On Aug 8, 2013, at 10:21 PM, Kiru Pakkirisamy kirupakkiris...@yahoo.com
 wrote:

  Ted,
  Here is the method signature/protocol
  public MapString, Double getFooMapString, Double input,
  int topN) throws IOException;
 
  There are 31 regions on 4 nodes X 8 CPU.
  I am on 0.94.6 (from Hortonworks).
  I think it seems to behave like what linwukang says, - it is almost a
 full table scan in the coprocessor.
  Actually, when I set more specific ColumnPrefixFilters performance went
 down.
  I want to do things on the server side because, I dont want to be
 sending 500K column/values to the client.
  I cannot believe a single-threaded client which does some calculations
 and group-by  beats the coprocessor running in 31 regions.
 
  Regards,
  - kiru
 
 
  Kiru Pakkirisamy | webcloudtech.wordpress.com
 
 
  
  From: Ted Yu yuzhih...@gmail.com
  To: user@hbase.apache.org; Kiru Pakkirisamy kirupakkiris...@yahoo.com
  Sent: Thursday, August 8, 2013 8:40 PM
  Subject: Re: Client Get vs Coprocessor scan performance
 
 
  Can you give us a bit more information ?
 
  How do you deliver the 55 rowkeys to your endpoint ?
  How many regions do you have for this table ?
 
  What HBase version are you using ?
 
  Thanks
 
  On Thu, Aug 8, 2013 at 6:43 PM, Kiru Pakkirisamy
  kirupakkiris...@yahoo.comwrote:
 
  Hi,
  I am finding an odd behavior with the Coprocessor performance lagging a
  client side Get.
  I have a table with 50 rows. Each have variable # of columns in one
  column family (in this case about 60 columns in total are processed)
  When I try to get specific 55 rows, the client side completes in
 half-the
  time as the coprocessor endpoint.
  I am using  55 RowFilters on the Coprocessor scan side. The rows are
  processed are exactly the same way in both the cases.
  Any pointers on how to debug this scenario ?
 
  Regards,
  - kiru
 
 
  Kiru Pakkirisamy | webcloudtech.wordpress.com



Re: Help in designing row key

2013-07-03 Thread James Taylor

Hi Flavio,
Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? 
It will allow you to model your multi-part row key like this:


CREATE TABLE flavio.analytics (
source INTEGER,
type INTEGER,
qual VARCHAR,
hash VARCHAR,
ts DATE
CONSTRAINT pk PRIMARY KEY (source, type, qual, hash, ts) // Defines 
columns that make up the row key

)

Then you can issue SQL queries like this (to query for the last 7 days 
worth of data):
SELECT * FROM flavio.analytics WHERE source IN (1,2,5) AND type IN 
(55,66) AND ts  CURRENT_DATE() - 7


This will internally take advantage of our SkipScan 
(http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html) 
to jump through your key space similar to FuzzyRowFilter, but in 
parallel from the client taking into account your region boundaries.


Or do more complex GROUP BY queries like this (to aggregate over the 
last 30 days worth of data, bucketized by day):
SELECT type,COUNT(*) FROM flavio.analytics WHERE ts  CURRENT_DATE() - 
30 GROUP BY type,TRUNCATE(ts,'DAY')


No need to worry about lexicographical sort order, flipping sign bits, 
normalizing/padding integer values, and all the other nuances of working 
with an API that works at the level of bytes. No need to write and 
manage installation of your own coprocessors to make aggregation 
efficient, perform topN queries, etc.


HTH.

Regards,
James
@JamesPlusPlus

On 07/03/2013 02:58 AM, Anoop John wrote:

When you make the RK and convert the int parts into byte[] ( Use
org.apache.hadoop.hbase.util.Bytes#toBytes(*int) *)  it will give 4 bytes
for every byte..  Be careful about the ordering...   When u convert a +ve
and -ve integer into byte[] and u do Lexiographical compare (as done in
HBase) u will see -ve number being greater than +ve..  If you dont have to
do deal with -ve numbers no issues  :)

Well when all the parts of the RK is of fixed width u will need any
seperator??

-Anoop-

On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier pomperma...@okkam.itwrote:


Yeah, I was thinking to use a normalization step in order to allow the use
of FuzzyRowFilter but what is not clear to me is if integers must also be
normalized or not.
I will explain myself better. Suppose that i follow your advice and I
produce keys like:
  - 1|1|somehash|sometimestamp
  - 55|555|somehash|sometimestamp

Whould they match the same pattern or do I have to normalize them to the
following?
  - 001|001|somehash|sometimestamp
  - 055|555|somehash|sometimestamp

Moreover, I noticed that you used dots ('.') to separate things instead of
pipe ('|')..is there a reason for that (maybe performance or whatever) or
is just your favourite separator?

Best,
Flavio


On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak m...@axiak.net wrote:


I'm not sure if you're eliding this fact or not, but you'd be much
better off if you used a fixed-width format for your keys. So in your
example, you'd have:

PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit
hash.8-byte timestamp

Example: \x00\x00\x00\x01\x00\x00\x02\x03

The advantage of this is not only that it's significantly less data
(remember your key is stored on each KeyValue), but also you can now
use FuzzyRowFilter and other techniques to quickly perform scans. The
disadvantage is that you have to normalize the source- integer but I
find I can either store that in an enum or cache it for a long time so
it's not a big issue.

-Mike

On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier pomperma...@okkam.it

wrote:

Thank you very much for the great support!
This is how I thought to design my key:

PATTERN: source|type|qualifier|hash(name)|timestamp
EXAMPLE:
google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753

Do you think my key could be good for my scope (my search will be
essentially by source or source|type)?
Another point is that initially I will not have so many sources, so I

will

probably have only google|* but in the next phases there could be more
sources..

Best,
Flavio

On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu yuzhih...@gmail.com wrote:


For #1, yes - the client receives less data after filtering.

For #2, please take a look at TestMultiVersions
(./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in

0.94)

for time range:

 scan = new Scan();

 scan.setTimeRange(1000L, Long.MAX_VALUE);
For row key selection, you need a filter. Take a look at
FuzzyRowFilter.java

Cheers

On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier 

pomperma...@okkam.it

wrote:
  Thanks for the reply! I thus have two questions more:

1) is it true that filtering on timestamps doesn't affect

performance..?

2) could you send me a little snippet of how you would do such a

filter

(by

row key + timestamps)? For example get all rows whose key starts

with

'someid-' and whose timestamps is greater than some timestamp?

Best,
Flavio


On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu yuzhih...@gmail.com wrote:


bq. Using timestamp 

Re: Help in designing row key

2013-07-03 Thread James Taylor
Sure, but FYI Phoenix is not just faster, but much easier as well (as 
this email chain shows).


On 07/03/2013 04:25 AM, Flavio Pompermaier wrote:

No, I've never seen Phoenix, but it looks like a very useful project!
However I don't have such strict performance issues in my use case, I just
want to have balanced regions as much as possible.
So I think that in this case I will still use Bytes concatenation if
someone confirm I'm doing it in the right way.


On Wed, Jul 3, 2013 at 12:33 PM, James Taylor jtay...@salesforce.comwrote:


Hi Flavio,
Have you had a look at Phoenix 
(https://github.com/**forcedotcom/phoenixhttps://github.com/forcedotcom/phoenix)?
It will allow you to model your multi-part row key like this:

CREATE TABLE flavio.analytics (
 source INTEGER,
 type INTEGER,
 qual VARCHAR,
 hash VARCHAR,
 ts DATE
 CONSTRAINT pk PRIMARY KEY (source, type, qual, hash, ts) // Defines
columns that make up the row key
)

Then you can issue SQL queries like this (to query for the last 7 days
worth of data):
SELECT * FROM flavio.analytics WHERE source IN (1,2,5) AND type IN (55,66)
AND ts  CURRENT_DATE() - 7

This will internally take advantage of our SkipScan (http://phoenix-hbase.
**blogspot.com/2013/05/**demystifying-skip-scan-in-**phoenix.htmlhttp://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html)
to jump through your key space similar to FuzzyRowFilter, but in parallel
from the client taking into account your region boundaries.

Or do more complex GROUP BY queries like this (to aggregate over the last
30 days worth of data, bucketized by day):
SELECT type,COUNT(*) FROM flavio.analytics WHERE ts  CURRENT_DATE() - 30
GROUP BY type,TRUNCATE(ts,'DAY')

No need to worry about lexicographical sort order, flipping sign bits,
normalizing/padding integer values, and all the other nuances of working
with an API that works at the level of bytes. No need to write and manage
installation of your own coprocessors to make aggregation efficient,
perform topN queries, etc.

HTH.

Regards,
James
@JamesPlusPlus


On 07/03/2013 02:58 AM, Anoop John wrote:


When you make the RK and convert the int parts into byte[] ( Use
org.apache.hadoop.hbase.util.**Bytes#toBytes(*int) *)  it will give 4
bytes
for every byte..  Be careful about the ordering...   When u convert a +ve
and -ve integer into byte[] and u do Lexiographical compare (as done in
HBase) u will see -ve number being greater than +ve..  If you dont have to
do deal with -ve numbers no issues  :)

Well when all the parts of the RK is of fixed width u will need any
seperator??

-Anoop-

On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier pomperma...@okkam.it

wrote:

  Yeah, I was thinking to use a normalization step in order to allow the

use
of FuzzyRowFilter but what is not clear to me is if integers must also be
normalized or not.
I will explain myself better. Suppose that i follow your advice and I
produce keys like:
   - 1|1|somehash|sometimestamp
   - 55|555|somehash|sometimestamp

Whould they match the same pattern or do I have to normalize them to the
following?
   - 001|001|somehash|sometimestamp
   - 055|555|somehash|sometimestamp

Moreover, I noticed that you used dots ('.') to separate things instead
of
pipe ('|')..is there a reason for that (maybe performance or whatever) or
is just your favourite separator?

Best,
Flavio


On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak m...@axiak.net wrote:

  I'm not sure if you're eliding this fact or not, but you'd be much

better off if you used a fixed-width format for your keys. So in your
example, you'd have:

PATTERN: source(4-byte-int).type(4-**byte-int or smaller).fixed 128-bit
hash.8-byte timestamp

Example: \x00\x00\x00\x01\x00\x00\x02\**x03

The advantage of this is not only that it's significantly less data
(remember your key is stored on each KeyValue), but also you can now
use FuzzyRowFilter and other techniques to quickly perform scans. The
disadvantage is that you have to normalize the source- integer but I
find I can either store that in an enum or cache it for a long time so
it's not a big issue.

-Mike

On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier 
pomperma...@okkam.it

wrote:


Thank you very much for the great support!
This is how I thought to design my key:

PATTERN: source|type|qualifier|hash(**name)|timestamp
EXAMPLE:
google|appliance|oven|**be9173589a7471a7179e928adc1a86**
f7|1372837702753

Do you think my key could be good for my scope (my search will be
essentially by source or source|type)?
Another point is that initially I will not have so many sources, so I


will


probably have only google|* but in the next phases there could be more
sources..

Best,
Flavio

On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu yuzhih...@gmail.com wrote:

  For #1, yes - the client receives less data after filtering.

For #2, please take a look at TestMultiVersions
(./src/test/java/org/apache/**hadoop/hbase/**TestMultiVersions.java
in


0.94)

for time range

Re: Schema design for filters

2013-06-27 Thread James Taylor
Hi Kristoffer,
Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? You 
could model your schema much like an O/R mapper and issue SQL queries through 
Phoenix for your filtering.

James
@JamesPlusPlus
http://phoenix-hbase.blogspot.com

On Jun 27, 2013, at 4:39 PM, Kristoffer Sjögren sto...@gmail.com wrote:

 Thanks for your help Mike. Much appreciated.
 
 I dont store rows/columns in JSON format. The schema is exactly that of a
 specific java class, where the rowkey is a unique object identifier with
 the class type encoded into it. Columns are the field names of the class
 and the values are that of the object instance.
 
 Did think about coprocessors but the schema is discovered a runtime and I
 cant hard code it.
 
 However, I still believe that filters might work. Had a look
 at SingleColumnValueFilter and this filter is be able to target specific
 column qualifiers with specific WritableByteArrayComparables.
 
 But list comparators are still missing... So I guess the only way is to
 write these comparators?
 
 Do you follow my reasoning? Will it work?
 
 
 
 
 On Fri, Jun 28, 2013 at 12:58 AM, Michael Segel
 michael_se...@hotmail.comwrote:
 
 Ok...
 
 If you want to do type checking and schema enforcement...
 
 You will need to do this as a coprocessor.
 
 The quick and dirty way... (Not recommended) would be to hard code the
 schema in to the co-processor code.)
 
 A better way... at start up, load up ZK to manage the set of known table
 schemas which would be a map of column qualifier to data type.
 (If JSON then you need to do a separate lookup to get the records schema)
 
 Then a single java class that does the look up and then handles the known
 data type comparators.
 
 Does this make sense?
 (Sorry, kinda was thinking this out as I typed the response. But it should
 work )
 
 At least it would be a design approach I would talk. YMMV
 
 Having said that, I expect someone to say its a bad idea and that they
 have a better solution.
 
 HTH
 
 -Mike
 
 On Jun 27, 2013, at 5:13 PM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 I see your point. Everything is just bytes.
 
 However, the schema is known and every row is formatted according to this
 schema, although some columns may not exist, that is, no value exist for
 this property on this row.
 
 So if im able to apply these typed comparators to the right cell values
 it may be possible? But I cant find a filter that target specific
 columns?
 
 Seems like all filters scan every column/qualifier and there is no way of
 knowing what column is currently being evaluated?
 
 
 On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel
 michael_se...@hotmail.comwrote:
 
 You have to remember that HBase doesn't enforce any sort of typing.
 That's why this can be difficult.
 
 You'd have to write a coprocessor to enforce a schema on a table.
 Even then YMMV if you're writing JSON structures to a column because
 while
 the contents of the structures could be the same, the actual strings
 could
 differ.
 
 HTH
 
 -Mike
 
 On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 I realize standard comparators cannot solve this.
 
 However I do know the type of each column so writing custom list
 comparators for boolean, char, byte, short, int, long, float, double
 seems
 quite straightforward.
 
 Long arrays, for example, are stored as a byte array with 8 bytes per
 item
 so a comparator might look like this.
 
 public class LongsComparator extends WritableByteArrayComparable {
  public int compareTo(byte[] value, int offset, int length) {
  long[] values = BytesUtils.toLongs(value, offset, length);
  for (long longValue : values) {
  if (longValue == val) {
  return 0;
  }
  }
  return 1;
  }
 }
 
 public static long[] toLongs(byte[] value, int offset, int length) {
  int num = (length - offset) / 8;
  long[] values = new long[num];
  for (int i = offset; i  num; i++) {
  values[i] = getLong(value, i * 8);
  }
  return values;
 }
 
 
 Strings are similar but would require charset and length for each
 string.
 
 public class StringsComparator extends WritableByteArrayComparable  {
  public int compareTo(byte[] value, int offset, int length) {
  String[] values = BytesUtils.toStrings(value, offset, length);
  for (String stringValue : values) {
  if (val.equals(stringValue)) {
  return 0;
  }
  }
  return 1;
  }
 }
 
 public static String[] toStrings(byte[] value, int offset, int length)
 {
  ArrayListString values = new ArrayListString();
  int idx = 0;
  ByteBuffer buffer = ByteBuffer.wrap(value, offset, length);
  while (idx  length) {
  int size = buffer.getInt();
  byte[] bytes = new byte[size];
  buffer.get(bytes);
  values.add(new String(bytes));
  idx += 4 + size;
  }
  return values.toArray(new String[values.size()]);
 }
 
 
 Am I on the right track or maybe overlooking some implementation
 details?
 Not 

Re: HBase: Filters not working for negative integers

2013-06-26 Thread James Taylor
You'll need to flip the sign bit for ints and longs like Phoenix does. 
Feel free to borrow our serializers (in PDataType) or just use Phoenix.


Thanks,

James

On 06/26/2013 12:16 AM, Madhukar Pandey wrote:

Please ignore my previous mail..there was some copy paste issue in it..
this is the correct mail..


We have implemented QualifierFilter as well as ValueFilter (using
BinaryComparator) of Hbase successfully and they are working fine for most
of our cases. However they are failing in cases like number  -10 or number
 -10

Please note that number = -10 is working perfectly fine. Also, number  10
and number  10 are also working fine.

If you want to see the code, please check following links:
1. QualifierFilter - Relevant lines are 126-142
https://github.com/deanhiller/playorm/blob/master/src/main/java/com/alvazan/orm/layer9z/spi/db/hadoop/CursorColumnSliceHbase.java


2. Value Filter - Relevant lines are 107-128
https://github.com/deanhiller/playorm/blob/master/src/main/java/com/alvazan/orm/layer9z/spi/db/hadoop/CursorOfHbaseIndexes.java


As per this blog(http://flurrytechblog.wordpress.com/2012/06/12/137492485/),
this can be an issue with serialization if we want to store negative values
for rowkeys and we should write our own serializers for comparison.
So we wanted to know:
1. Is it really necessary to write our own serializer in this case?
2. If yes, how? Any example would be great help.





On Wed, Jun 26, 2013 at 12:33 PM, Madhukar Pandey madhu...@easility.comwrote:


We have implemented QualifierFilter as well as ValueFilter (using
BinaryComparator) of Hbase successfully and they are working fine for
most of our cases. However they are failing in cases like number  -10 or number
 -10

Please note that number = -10 is working perfectly fine. Also, number  10
and number  10 are also working fine.

If you want to see the code, please check following links:
1. 
Qhttps://github.com/deanhiller/playorm/blob/master/src/main/java/com/alvazan/orm/layer9z/spi/db/hadoop/CursorColumnSliceHbase.javaWe
have implemented QualifierFilter as well as ValueFilter (using
BinaryComparator) of Hbase successfully and they are working fine for most
of our cases. However they are failing in cases like number  -10 or number
 -10


Please note that number = -10 is working perfectly fine. Also, number  10
and number  10 are also working fine.


If you want to see the code, please check following links:

1. QualifierFilter - Relevant lines are 126-142

2. Value Filter - Relevant lines are 107-128


As per this blog, this can be an issue with serialization if we want to
store negative values for rowkeys and we should write our own serializers
for comparison.

So we wanted to know:

1. Is it really necessary to write our own serializer in this case?

2. If yes, how? Any example would be great 
help.ualifierFilterhttps://github.com/deanhiller/playorm/blob/master/src/main/java/com/alvazan/orm/layer9z/spi/db/hadoop/CursorColumnSliceHbase.java
 -
Relevant lines are 126-142
2. Value 
Filterhttps://github.com/deanhiller/playorm/blob/master/src/main/java/com/alvazan/orm/layer9z/spi/db/hadoop/CursorOfHbaseIndexes.java
 -
Relevant lines are 107-128

As per this bloghttp://flurrytechblog.wordpress.com/2012/06/12/137492485/,
this can be an issue with serialization if we want to store negative values
for rowkeys and we should write our own serializers for comparison.
So we wanted to know:
1. Is it really necessary to write our own serializer in this case?
2. If yes, how? Any example would be great help.





Re: Scan performance

2013-06-22 Thread James Taylor
Hi Tony,
Have you had a look at Phoenix(https://github.com/forcedotcom/phoenix), a SQL 
skin over HBase? It has a skip scan that will let you model a multi part row 
key and skip through it efficiently as you've described. Take a look at this 
blog for more info: 
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html?m=1

Regards,
James

On Jun 22, 2013, at 6:29 AM, lars hofhansl la...@apache.org wrote:

 Yep generally you should design your keys such that start/stopKey can 
 efficiently narrow the scope.
 
 If that really cannot be done (and you should try hard), the 2nd  best option 
 are skip scans.
 
 Filters in HBase allow for providing the scanner framework with hints where 
 to go next.
 They can skip to the next column (to avoid looking at many versions), to the 
 next row (to avoid looking at many columns), or they can provide a custom 
 seek hint to a specific key value. The latter is what FuzzyRowFilter does.
 
 
 -- Lars
 
 
 
 
 From: Anoop John anoop.hb...@gmail.com
 To: user@hbase.apache.org
 Sent: Friday, June 21, 2013 11:58 PM
 Subject: Re: Scan performance
 
 
 Have a look at FuzzyRowFilter
 
 -Anoop-
 
 On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean tony.d...@sas.com wrote:
 
 I understand more, but have additional questions about the internals...
 
 So, in this example I have 6000 rows X 40 columns in this table.  In this
 test my startRow and stopRow do not narrow the scan criterior therefore all
 6000x40 KVs must be included in the search and thus read from disk and into
 memory.
 
 The first filter that I used was:
 Filter f2 = new SingleColumnValueFilter(cf, qualifier,
 CompareFilter.CompareOp.EQUALS, value);
 
 This means that HBase must look for the qualifier column on all 6000 rows.
 As you mention I could add certain columns to a different cf; but
 unfortunately, in my case there is no such small set of columns that will
 need to be compared (filtered on).  I could try to use indexes so that a
 complete row key can be calculated from a secondary index in order to
 perform a faster search against data in a primary table.  This requires
 additional tables and maintenance that I would like to avoid.
 
 I did try a row key filter with regex hoping that it would limit the
 number of rows that were read from disk.
 Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
 RegexStringComparator(row_regexpr));
 
 My row keys are something like: vid,sid,event.  sid is not known at query
 time so I can use a regex similar to: vid,.*,Logon where Logon is the event
 that I am looking for in a particular visit.  In my test data this should
 have narrowed the scan to 1 row X 40 columns.  The best I could do for
 start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
 going to cause all 6000 rows to be scanned, but the filtering should be
 more specific with the rowKey filter.  However, I did not see any
 performance improvement.  Anything obvious?
 
 Do you have any other ideas to help out with performance when row key is:
 vid,sid,event and sid is not known at query time which leaves a gap in the
 start/stop row?  Too bad regex can't be used in start/stop row
 specification.  That's really what I need.
 
 Thanks again.
 -Tony
 
 -Original Message-
 From: Vladimir Rodionov [mailto:vrodio...@carrieriq.com]
 Sent: Friday, June 21, 2013 8:00 PM
 To: user@hbase.apache.org; lars hofhansl
 Subject: RE: Scan performance
 
 Lars,
 I thought that column family is the locality group and placement columns
 which are frequently accessed together into the same column family
 (locality group) is the obvious performance improvement tip. What are the
 essential column families for in this context?
 
 As for original question..  Unless you place your column into a separate
 column family in Table 2, you will need to scan (load from disk if not
 cached) ~ 40x more data for the second table (because you have 40 columns).
 This may explain why do  see such a difference in execution time if all
 data needs to be loaded first from HDFS.
 
 Best regards,
 Vladimir Rodionov
 Principal Platform Engineer
 Carrier IQ, www.carrieriq.com
 e-mail: vrodio...@carrieriq.com
 
 
 From: lars hofhansl [la...@apache.org]
 Sent: Friday, June 21, 2013 3:37 PM
 To: user@hbase.apache.org
 Subject: Re: Scan performance
 
 HBase is a key value (KV) store. Each column is stored in its own KV, a
 row is just a set of KVs that happen to have the row key (which is the
 first part of the key).
 I tried to summarize this here:
 http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
 
 In the StoreFiles all KVs are sorted in row/column order, but HBase still
 needs to skip over many KVs in order to reach the next row. So more disk
 and memory IO is needed.
 
 If you using 0.94 there is a new feature essential column families. If
 you always search by the same column you can place that one in its own
 

Re: querying hbase

2013-06-01 Thread James Taylor
These approaches all sound somewhat brittle and unlikely to be relied on 
for a production system (more here: 
https://issues.apache.org/jira/browse/HBASE-8607). Sounds like a rolling 
restart is the best option in the near/medium term. Our pain points are 
more around how to get to the point where Phoenix can more easily be 
installed. Maybe https://issues.apache.org/jira/browse/HBASE-8400 would 
help?


I propose we move the discussion to those JIRAs.

On 06/01/2013 11:15 AM, Michael Segel wrote:

Well,

What happens when you restart the RS?

Suppose I'm running a scan on a completely different table and you restart the 
RS?
What happens to me?

I havent thought through the whole problem, but you need to put each table's CP 
in to its own sandbox.
(There's more to it and would require some pizza, beer and a very large 
whiteboard)


On Jun 1, 2013, at 5:44 AM, Andrew Purtell apurt...@apache.org wrote:


Isn't the time to restart and the steps necessary more or less the same? Or
will the objects that hold the in memory state survive across the reload?
Will they still share a classloader (maintain equality tests)? What if the
implementation / bundle version changes? We are taking about an upgrade
scenario. Will we need to dump and restore in memory state to local disk,
pickle the state of an earlier version and have the latest version
unpickle, fixing up as needed? What happens if that fails midway?
The JITted code for the old bundle is unused and GCed now that the bundle
is upgraded, so we have to wait for runtime profiling and C2 to crunch the
bytecode again for the new bundle. Will all that need more time than just
restating a JVM ? Am I missing a simpler way?

On Saturday, June 1, 2013, Michel Segel wrote:


Is there a benefit to restarting a regionserver in an OSGi container

versus

restarting a Java process?

Was that rhetorical?

Absolutely.
Think of a production environment where you are using HBase to serve data
in real time.


Sent from a remote device. Please excuse any typos...

Mike Segel

On May 24, 2013, at 4:50 PM, Andrew Purtell apurt...@apache.orgjavascript:;
wrote:


On Thu, May 23, 2013 at 5:10 PM, James Taylor 
jtay...@salesforce.comjavascript:;
wrote:


Has there been any discussions on running the HBase server in an OSGi
container?


I believe the only discussions have been on avoiding talk about

coprocessor

reloading, as it implies either a reimplementation of or taking on an

OSGi

runtime.

Is there a benefit to restarting a regionserver in an OSGi container

versus

restarting a Java process?

Or would that work otherwise like an update the coprocessor and filters

in

the container then trigger the embedded regionserver to do a quick close
and reopen of the regions?

--
Best regards,

  - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)


--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)




Re: querying hbase

2013-05-31 Thread James Taylor

On 05/24/2013 02:50 PM, Andrew Purtell wrote:

On Thu, May 23, 2013 at 5:10 PM, James Taylor jtay...@salesforce.comwrote:


Has there been any discussions on running the HBase server in an OSGi
container?


I believe the only discussions have been on avoiding talk about coprocessor
reloading, as it implies either a reimplementation of or taking on an OSGi
runtime.

Is there a benefit to restarting a regionserver in an OSGi container versus
restarting a Java process?

Or would that work otherwise like an update the coprocessor and filters in
the container then trigger the embedded regionserver to do a quick close
and reopen of the regions?

My thinking was that an OSGi container would allow a new version of a 
coprocessor (and/or custom filter) jar to be loaded. Class conflicts 
between the old jar and the new jar would no longer be a problem - you'd 
never need to unload the old jar. Instead, future HBase operations that 
invoke the coprocessor would cause the newly loaded jar to be used 
instead of the older one. I'm not sure if this is possible or not. The 
whole idea would be to prevent a rolling restart or region close/reopen.


Re: Couting number of records in a HBase table

2013-05-28 Thread James Taylor
Another option is Phoenix (https://github.com/forcedotcom/phoenix), 
where you'd do

SELECT count(*) FROM my_table

Regards,
James

On 05/28/2013 03:25 PM, Ted Yu wrote:

Take a look at http://hbase.apache.org/book.html#rowcounter

Cheers

On Tue, May 28, 2013 at 3:23 PM, Shahab Yunus shahab.yu...@gmail.comwrote:


Is there a faster way to get the count of rows in a HBase table
(potentially a huge one)? I am looking for ways other than the 'count'
shell command or any Pig script? Thanks.

Regards,
Shahab





Re: querying hbase

2013-05-23 Thread James Taylor
Actually, with the great work you guys have been doing and the 
resolution of HBASE-1936 by Jimmy Xiang, we'll be able to ease the 
installation of Phoenix in our next release. You'll still need to bounce 
the regions servers to reload our custom filters and coprocessors, but 
you won't need to manually add the phoenix jar to the hbase classpath on 
each region server (as long as the installing user has permission to 
write into HDFS).


Has there been any discussions on running the HBase server in an OSGi 
container? That would potentially even alleviate the need to bounce the 
region servers. I didn't see a JIRA, so I created this one: 
https://issues.apache.org/jira/browse/HBASE-8607


Thanks,
James

On 05/23/2013 04:17 PM, Jean-Marc Spaggiari wrote:

Hi James,

Thanks for joining the thread to provide more feedback and valuable
information about Phoenix. I don't have a big knowledge on it, so better to
see you around.

The only thing I was referring is that applications I sent the links for
are simple jars that you can download locally and run without requiring any
specific rights to install/upload anything on any server. Just download,
click on it.

I might be wrong because I did not try Phoenix yet, but I think you need to
upload the JAR on all the region servers first, and then restart them,
right? People might not have the rights to do that. That's why I thought
Pheonix was overkill regarding the need to just list a table content on a
screen.

JM

2013/5/22 James Taylor jtay...@salesforce.com


Hey JM,
Can you expand on what you mean? Phoenix is a single jar, easily deployed
to any HBase cluster. It can map to existing HBase tables or create new
ones. It allows you to use SQL (a fairly popular language) to query your
data, and it surfaces it's functionality as a JDBC driver so that it can
interop with the SQL ecosystem (which has been around for a while).
Thanks,
James


On 05/21/2013 08:41 PM, Jean-Marc Spaggiari wrote:


Using Phoenix for that is like trying to kill a mosquito with an atomic
bomb, no? ;)

Few easy to install and use tools which I already tried:
- 
http://sourceforge.net/**projects/haredbhbaseclie/**files/http://sourceforge.net/projects/haredbhbaseclie/files/
- 
http://sourceforge.net/**projects/hbasemanagergui/http://sourceforge.net/projects/hbasemanagergui/
- 
https://github.com/**NiceSystems/hrider/wikihttps://github.com/NiceSystems/hrider/wiki

There might be other, but those one at least are doing the basic things to
look into you tables.

JM

2013/5/21 lars hofhansl la...@apache.org

  Maybe Phoenix 
(http://phoenix-hbase.**blogspot.com/http://phoenix-hbase.blogspot.com/)

is what you are
looking for.

-- Lars

__**__
From: Aji Janis aji1...@gmail.com
To: user user@hbase.apache.org
Sent: Tuesday, May 21, 2013 3:43 PM
Subject: Re: querying hbase


I haven't tried that because I don't know how to. Still I think I am
looking for a nice GUI interface that can take in HBase connection info
and
help me view the data something like pgadmin (or its php version), sql
developer, etc


On Tue, May 21, 2013 at 6:16 PM, Viral Bajaria viral.baja...@gmail.com


wrote:
The shell allows you to use filters just like the standard HBase API but
with jruby syntax. Have you tried that or that is too painful and you


want


a simpler tool ?

-Viral

On Tue, May 21, 2013 at 2:58 PM, Aji Janis aji1...@gmail.com wrote:

  are there any tools out there that can help in visualizing data stored
in
Hbase? I know the shell lets you do basic stuff. But if I don't know
what
rowid I am looking for or if I want to rows with family say *name* (yes
SQL


like) are there any tools that can help with this? Not trying to use


this
on production (although that would be nice) just dev env for now. Thank
you


for any suggestionns






Re: querying hbase

2013-05-22 Thread James Taylor

Hey JM,
Can you expand on what you mean? Phoenix is a single jar, easily 
deployed to any HBase cluster. It can map to existing HBase tables or 
create new ones. It allows you to use SQL (a fairly popular language) to 
query your data, and it surfaces it's functionality as a JDBC driver so 
that it can interop with the SQL ecosystem (which has been around for a 
while).

Thanks,
James

On 05/21/2013 08:41 PM, Jean-Marc Spaggiari wrote:

Using Phoenix for that is like trying to kill a mosquito with an atomic
bomb, no? ;)

Few easy to install and use tools which I already tried:
- http://sourceforge.net/projects/haredbhbaseclie/files/
- http://sourceforge.net/projects/hbasemanagergui/
- https://github.com/NiceSystems/hrider/wiki

There might be other, but those one at least are doing the basic things to
look into you tables.

JM

2013/5/21 lars hofhansl la...@apache.org


Maybe Phoenix (http://phoenix-hbase.blogspot.com/) is what you are
looking for.

-- Lars


From: Aji Janis aji1...@gmail.com
To: user user@hbase.apache.org
Sent: Tuesday, May 21, 2013 3:43 PM
Subject: Re: querying hbase


I haven't tried that because I don't know how to. Still I think I am
looking for a nice GUI interface that can take in HBase connection info and
help me view the data something like pgadmin (or its php version), sql
developer, etc


On Tue, May 21, 2013 at 6:16 PM, Viral Bajaria viral.baja...@gmail.com

wrote:
The shell allows you to use filters just like the standard HBase API but
with jruby syntax. Have you tried that or that is too painful and you

want

a simpler tool ?

-Viral

On Tue, May 21, 2013 at 2:58 PM, Aji Janis aji1...@gmail.com wrote:


are there any tools out there that can help in visualizing data stored

in

Hbase? I know the shell lets you do basic stuff. But if I don't know

what

rowid I am looking for or if I want to rows with family say *name* (yes

SQL

like) are there any tools that can help with this? Not trying to use

this

on production (although that would be nice) just dev env for now. Thank

you

for any suggestionns





Re: querying hbase

2013-05-22 Thread James Taylor

Hi Aji,
With Phoenix, you pass through the client port in your connection 
string, so this would not be an issue. If you're familiar with SQL 
Developer, then Phoenix supports something similar with SQuirrel: 
https://github.com/forcedotcom/phoenix#sql-client

Regards,
James


On 05/22/2013 07:42 AM, Aji Janis wrote:

These tools seem just like what I want! Thank you.
I am trying to play with it now but looks like in our Hbase
configuration HBASE_MANAGE_ZK is set to False in hbase-env and
hbase.zookeeper.property.clientPort is not set in hbase-site and
therefore I can't use hbasemanager or hrider. I am new to Hbase can
anyone explain to me why these properties may not be set? Should we
have it set? Whats the steps to set them - ie restart what things in
what order?

Thank you again for the feedback!




Re: [ANNOUNCE] Phoenix 1.2 is now available

2013-05-20 Thread James Taylor
Our coprocessors are all in com.salesforce.phoenix.coprocessor. The 
particular one that handles TopN is ScanRegionObserver.
The expression evaluation classes are in 
com.salesforce.phoenix.expression, with a base interface of Expression.
The type system is in com.salesforce.phoenix.schema. Take a look at 
PDataType - that's where the SQL types are defined.
The ORDER BY evaluation is mostly handled by the expression classes, but 
there's a little bit more in OrderByExpression.
Besides the expression evaluation, a lot of the runtime is handled 
through nested iterators. Take a look at com.salesforce.phoenix.iterate 
and in particular for TopN the OrderedResultIterator class. Also, if you 
want to see how we nest the iterators, take a look at the implementors 
of com.salesforce.phoenix.execute.QueryPlan - ScanPlan and AggregatePlan.


Might be useful to build the javadocs too - that'll give you a bit more 
detail.


Regards,
James

On 05/20/2013 04:07 AM, Azuryy Yu wrote:

why off-list? it would be better share here.

--Send from my Sony mobile.
On May 18, 2013 12:14 AM, James Taylor jtay...@salesforce.com wrote:


Anil,
Yes, everything is in the Phoenix GitHub repo. Will give you more detail
of specific packages and classes off-list.
Thanks,

James

On 05/16/2013 05:33 PM, anil gupta wrote:


Hi James,

Is this implementation present in the GitHub repo of Phoenix? If yes, can
you provide me the package name/classes?
I haven't got the opportunity to try out Phoenix yet but i would like to
have a look at the implementation.

Thanks,
Anil Gupta




On Thu, May 16, 2013 at 4:15 PM, James Taylor jtay...@salesforce.com

wrote:

  Hi Anil,

No HBase changes were required. We're already leveraging coprocessors in
HBase which is a key enabler. The other pieces needed are:
- a type system
- a means to evaluate an ORDER BY expression on the server
- memory tracking/throttling (the topN for each region are held in memory
until the client does a merge sort)
Phoenix has all these, so it was just a matter of packaging them up to
support this.

Thanks,
James


On 05/16/2013 02:02 PM, anil gupta wrote:

  Hi James,

You have mentioned support for TopN query. Can you provide me HBase Jira
ticket for that. I am also doing similar stuff in
https://issues.apache.org/jira/browse/HBASE-7474https://issues.apache.org/**jira/browse/HBASE-7474
https:/**/issues.apache.org/jira/**browse/HBASE-7474https://issues.apache.org/jira/browse/HBASE-7474

.

I am interested in
knowing the details about that implementation.

Thanks,
Anil Gupta


On Thu, May 16, 2013 at 12:29 PM, James Taylor jtay...@salesforce.com


wrote:


   We are pleased to announce the immediate availability of Phoenix 1.2 (


https://github.com/**forcedotcom/phoenix/wiki/**Downloadhttps://github.com/forcedotcom/phoenix/wiki/Download
https://github.com/forcedotcom/phoenix/wiki/Downloadhttps://github.com/**forcedotcom/phoenix/wiki/**Download
https://github.com/forcedotcom/phoenix/wiki/Downloadhttps://github.com/**forcedotcom/phoenix/wiki/**Download
https://github.com/**forcedotcom/phoenix/wiki/**Downloadhttps://github.com/forcedotcom/phoenix/wiki/Download

).


Here are some of the release highlights:

* Improve performance of multi-point and multi-range queries (20x plus)
using new skip scan
* Support TopN queries (3-70x faster than Hive)
* Control row key order when defining primary key columns
* Salt tables declaratively to prevent hot spotting
* Specify columns dynamically at query time
* Write Phoenix-compliant HFiles from Pig scripts and Map/Reduce jobs
* Support SELECT DISTINCT
* Leverage essential column family feature
* Bundle command line terminal interface
* Specify scale and precision on decimal type
* Support fixed length binary type
* Add TO_CHAR, TO_NUMBER, COALESCE, UPPER, LOWER, and REVERSE built-in
functions

HBase 0.94.4 or above is required with HBase 0.94.7 being recommended.
For
more detail, please see our announcement:
http://phoenix-hbase.blogspot.

com/2013/05/announcing-**phoenix-12.htmlhttp://**
phoenix-hbase.blogspot.com/2013/05/announcing-phoenix-12.htmlhttp://phoenix-hbase.blogspot.com/**2013/05/announcing-phoenix-12.**html
http://phoenix-hbase.**blogspot.com/2013/05/**
announcing-phoenix-12.htmlhttp://phoenix-hbase.blogspot.com/2013/05/announcing-phoenix-12.html
Regards,

James
@JamesPlusPlus
http://phoenix-hbase.blogspot.**com/ http://phoenix-hbase.**
blogspot.com/ 
http://phoenix-hbase.**blogspot.com/http://phoenix-hbase.blogspot.com/






Re: Some Hbase questions

2013-05-19 Thread James Taylor

Hi Vivek,
Take a look at the SQL skin for HBase called Phoenix 
(https://github.com/forcedotcom/phoenix). Instead of using the native 
HBase client, you use regular JDBC and Phoenix takes care of making the 
native HBase calls for you.


We support composite row keys, so you could form your row key like this:
CREATE TABLE TimeSeries (
host VARCHAR NOT NULL,
date DATE NOT NULL,
value1 BIGINT,
value2 DECIMAL(10,4)
CONSTRAINT pk PRIMARY KEY (host, date)); // composite row key of 
host + date


Then to do aggregate queries, you use our built in AVG, SUM, COUNT, MIN, 
MAX:

SELECT AVG(value1), SUM(value2) * 123.45 / 678.9 FROM TimeSeries
WHERE host IN ('host1','host2')
GROUP BY TRUNC(date, 'DAY')  // group into day sized buckets

For debugging, you can use either the terminal command line we bundle: 
https://github.com/forcedotcom/phoenix#command-line
or you can install a SQL client like SQuirrel: 
https://github.com/forcedotcom/phoenix#sql-client

You'll see your integer, date, and decimal types as you'd expect.

We have integration with map/reduce and Pig, so you could use those 
tools in conjunction with Phoenix.


We also support TopN queries, select distinct, transparent salting for 
when your row key leads with a monotonically increasing value like time, 
and our performance 
(https://github.com/forcedotcom/phoenix/wiki/Performance) can't be beat. 
See our recent announcement for more detail: 
http://phoenix-hbase.blogspot.com/2013/05/announcing-phoenix-12.html


HTH.

Regards,

James
@JamesPlusPlus


On 05/19/2013 08:41 AM, Ted Yu wrote:

For #b, take a look
at 
src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java
in 0.94
It supports avg, max, min and sum operations through calling coprocessors.
Here is snippet from its javadoc:

  * This client class is for invoking the aggregate functions deployed on the
  * Region Server side via the AggregateProtocol. This class will implement
the
  * supporting functionality for summing/processing the individual results
  * obtained from the AggregateProtocol for each region.

For #c, running HBase and MR on the same cluster is acceptable. If you have
additional hardware, you can run mapreduce jobs on separate machines where
region server is not running.

Cheers

On Sun, May 19, 2013 at 8:29 AM, Vivek Padmanabhan
vpadmanab...@aryaka.comwrote:


Hi,
   I am pretty new to HBase so it would be great if someone could help me
out with my below queries;

(Ours is a time series data and all the queries will be range scan on
  composite row keys)

a) What is the usual practice of storing data types.

We have noticed that converting datatypes to bytes render unreadable
data while debugging.
For ids, or int values we see the byte representation. So for some
important columns
we converted into  datatype - characters -bytes, rather than datatype
- bytes
(May be we can write a wrapper over hbase shell to solve this. But is
there a simpler way)


b) What is the best way to achieve operations like AVG,SUM or some custom
formula for real time queries. Coprocessors or in-memory with query result?
(The formula that we apply might get changed at any time so storing
result is not an option)


c) We are planning to start off with a four node cluster, having both
HBase and MR jobs running.
I have heard that it is not recommended to have both HBase and MR on
the same cluster, but I would
like to understand what could be the possible bottle necks.

   (We plan to run MR on HDFS and MR on Hbase. Most of our MR jobs are IO
bound rather than CPU bound)


Thanks
Vivek





Re: [ANNOUNCE] Phoenix 1.2 is now available

2013-05-17 Thread James Taylor

Anil,
Yes, everything is in the Phoenix GitHub repo. Will give you more detail 
of specific packages and classes off-list.

Thanks,

James

On 05/16/2013 05:33 PM, anil gupta wrote:

Hi James,

Is this implementation present in the GitHub repo of Phoenix? If yes, can
you provide me the package name/classes?
I haven't got the opportunity to try out Phoenix yet but i would like to
have a look at the implementation.

Thanks,
Anil Gupta




On Thu, May 16, 2013 at 4:15 PM, James Taylor jtay...@salesforce.comwrote:


Hi Anil,
No HBase changes were required. We're already leveraging coprocessors in
HBase which is a key enabler. The other pieces needed are:
- a type system
- a means to evaluate an ORDER BY expression on the server
- memory tracking/throttling (the topN for each region are held in memory
until the client does a merge sort)
Phoenix has all these, so it was just a matter of packaging them up to
support this.

Thanks,
James


On 05/16/2013 02:02 PM, anil gupta wrote:


Hi James,

You have mentioned support for TopN query. Can you provide me HBase Jira
ticket for that. I am also doing similar stuff in
https://issues.apache.org/**jira/browse/HBASE-7474https://issues.apache.org/jira/browse/HBASE-7474.
I am interested in
knowing the details about that implementation.

Thanks,
Anil Gupta


On Thu, May 16, 2013 at 12:29 PM, James Taylor jtay...@salesforce.com

wrote:

  We are pleased to announce the immediate availability of Phoenix 1.2 (

https://github.com/forcedotcom/phoenix/wiki/Downloadhttps://github.com/**forcedotcom/phoenix/wiki/**Download
https://github.com/**forcedotcom/phoenix/wiki/**Downloadhttps://github.com/forcedotcom/phoenix/wiki/Download

).

Here are some of the release highlights:

* Improve performance of multi-point and multi-range queries (20x plus)
using new skip scan
* Support TopN queries (3-70x faster than Hive)
* Control row key order when defining primary key columns
* Salt tables declaratively to prevent hot spotting
* Specify columns dynamically at query time
* Write Phoenix-compliant HFiles from Pig scripts and Map/Reduce jobs
* Support SELECT DISTINCT
* Leverage essential column family feature
* Bundle command line terminal interface
* Specify scale and precision on decimal type
* Support fixed length binary type
* Add TO_CHAR, TO_NUMBER, COALESCE, UPPER, LOWER, and REVERSE built-in
functions

HBase 0.94.4 or above is required with HBase 0.94.7 being recommended.
For
more detail, please see our announcement: http://phoenix-hbase.blogspot.

com/2013/05/announcing-phoenix-12.htmlhttp://**
phoenix-hbase.blogspot.com/**2013/05/announcing-phoenix-12.**htmlhttp://phoenix-hbase.blogspot.com/2013/05/announcing-phoenix-12.html
Regards,

James
@JamesPlusPlus
http://phoenix-hbase.blogspot.com/ http://phoenix-hbase.**
blogspot.com/ http://phoenix-hbase.blogspot.com/










[ANNOUNCE] Phoenix 1.2 is now available

2013-05-16 Thread James Taylor
We are pleased to announce the immediate availability of Phoenix 1.2 
(https://github.com/forcedotcom/phoenix/wiki/Download). Here are some of 
the release highlights:


* Improve performance of multi-point and multi-range queries (20x plus) 
using new skip scan

* Support TopN queries (3-70x faster than Hive)
* Control row key order when defining primary key columns
* Salt tables declaratively to prevent hot spotting
* Specify columns dynamically at query time
* Write Phoenix-compliant HFiles from Pig scripts and Map/Reduce jobs
* Support SELECT DISTINCT
* Leverage essential column family feature
* Bundle command line terminal interface
* Specify scale and precision on decimal type
* Support fixed length binary type
* Add TO_CHAR, TO_NUMBER, COALESCE, UPPER, LOWER, and REVERSE built-in 
functions


HBase 0.94.4 or above is required with HBase 0.94.7 being recommended. 
For more detail, please see our announcement: 
http://phoenix-hbase.blogspot.com/2013/05/announcing-phoenix-12.html


Regards,

James
@JamesPlusPlus
http://phoenix-hbase.blogspot.com/


Re: [ANNOUNCE] Phoenix 1.2 is now available

2013-05-16 Thread James Taylor

Hi Anil,
No HBase changes were required. We're already leveraging coprocessors in 
HBase which is a key enabler. The other pieces needed are:

- a type system
- a means to evaluate an ORDER BY expression on the server
- memory tracking/throttling (the topN for each region are held in 
memory until the client does a merge sort)
Phoenix has all these, so it was just a matter of packaging them up to 
support this.


Thanks,
James

On 05/16/2013 02:02 PM, anil gupta wrote:

Hi James,

You have mentioned support for TopN query. Can you provide me HBase Jira
ticket for that. I am also doing similar stuff in
https://issues.apache.org/jira/browse/HBASE-7474. I am interested in
knowing the details about that implementation.

Thanks,
Anil Gupta


On Thu, May 16, 2013 at 12:29 PM, James Taylor jtay...@salesforce.comwrote:


We are pleased to announce the immediate availability of Phoenix 1.2 (
https://github.com/**forcedotcom/phoenix/wiki/**Downloadhttps://github.com/forcedotcom/phoenix/wiki/Download).
Here are some of the release highlights:

* Improve performance of multi-point and multi-range queries (20x plus)
using new skip scan
* Support TopN queries (3-70x faster than Hive)
* Control row key order when defining primary key columns
* Salt tables declaratively to prevent hot spotting
* Specify columns dynamically at query time
* Write Phoenix-compliant HFiles from Pig scripts and Map/Reduce jobs
* Support SELECT DISTINCT
* Leverage essential column family feature
* Bundle command line terminal interface
* Specify scale and precision on decimal type
* Support fixed length binary type
* Add TO_CHAR, TO_NUMBER, COALESCE, UPPER, LOWER, and REVERSE built-in
functions

HBase 0.94.4 or above is required with HBase 0.94.7 being recommended. For
more detail, please see our announcement: http://phoenix-hbase.blogspot.**
com/2013/05/announcing-**phoenix-12.htmlhttp://phoenix-hbase.blogspot.com/2013/05/announcing-phoenix-12.html

Regards,

James
@JamesPlusPlus
http://phoenix-hbase.blogspot.**com/ http://phoenix-hbase.blogspot.com/








Re: Get all rows that DON'T have certain qualifiers

2013-05-14 Thread James Taylor
Hi Amit,
Using Phoenix, the SQL skin over HBase 
(https://github.com/forcedotcom/phoenix), you'd do this:

select * from myTable where value1 is null or value2 is null

Regards,
James
http://phoenix-hbase.blogspot.com
@JamesPlusPlus

On May 14, 2013, at 6:56 AM, samar.opensource samar.opensou...@gmail.com 
wrote:

 Hi,
 
 I will try to write a sample code and execute it , but what i gather 
 from the blog and the java apidoc is that u just need to do opposite of 
 what u r doing .
 
 so use
 CompareOp.|*EQUAL 
 http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/CompareFilter.CompareOp.html#EQUAL*|
  
 and then put a value so that it never occurs in you column ( this will 
 filter out all the rows for which the column qualifier exists )
 
 secondly
 
 filter1.setFilterIfMissing(false)
 since : If false, the row will pass if the column is not found. This is 
 default. (taken from apidoc)
 
  This way you should be able to get all the rows that don have the 
 certain qualifier.
 
 Regards,
 Samar
 
 On 08/05/13 8:17 PM, Ted Yu wrote:
 I think you can implement your own filter that overrides this method:
 
   public void filterRow(ListKeyValue ignored) throws IOException {
 When certain qualifiers don't appear in the List, you can remove all the
 kvs from the passed List.
 
 Cheers
 
 On Wed, May 8, 2013 at 7:00 AM, Amit Sela am...@infolinks.com wrote:
 
 Forgot to mention: Hadoop 1.0.4  HBase 0.94.2
 
 
 On Wed, May 8, 2013 at 4:52 PM, Amit Sela am...@infolinks.com wrote:
 
 Hi all,
 
 I'm trying to scan my HBase table to get only rows that are missing some
 qualifiers.
 
 I read that for getting rows with specific qualifiers I should use
 something like:
 
 List list = new ArrayListFilter(2);
 Filter filter1 = new SingleColumnValueFilter(Bytes.toBytes(fam1),
  Bytes.toBytes(VALUE1), CompareOp.DOES_NOT_EQUAL,
 Bytes.toBytes(DOESNOTEXIST));
 filter1.setFilterIfMissing(true);
 list.addFilter(filter1);
 Filter filter2 = new SingleColumnValueFilter(Bytes.toBytes(fam2),
  Bytes.toBytes(VALUE2), CompareOp.DOES_NOT_EQUAL,
 Bytes.toBytes(DOESNOTEXIST));
 filter2.setFilterIfMissing(true);
 list.addFilter(filter2);
 FilterList filterList = new FilterList(list);
 Scan scan = new Scan();
 scan.setFilter(filterList);
 
 (I found this here:
 
 http://mapredit.blogspot.co.il/2012/05/using-filters-in-hbase-to-match-two.html
 )
 And it works just fine.
 
 So as I thought that if I use SkipFilter(FilterList) I'll skip the rows
 returned by the filter list  causing a sort of NOT and getting all rows
 that don't have any of theses qualifiers.
 
 This doesn't seem to work... Anyone has a good suggestion how to get rows
 that are missing specific qualifiers ? Any idea why SkipFilter fails ?
 
 Thanks,
 
 Amit
 
 


Re: Coprocessors

2013-05-01 Thread James Taylor

Sudarshan,
Below are the results that Mujtaba put together. He put together two
version of your schema: one with the ATTRIBID as part of the row key
and one with it as a key value. He also benchmarked the query time both
when all of the data was in the cache versus when all of the data was
read off of disk.

Let us know if you have any questions/follow up.

Thanks,

James ( Mujtaba)

 Compute Average over 250K random rows in 1B row table

 ATTRIBID in row key
 Data from HBase cache   Data loaded from disk
Phoenix Skip Scan  1.4 sec 31 sec
HBase Batched Gets 3.8 sec 58 sec
HBase Range Scan-  10+ min

 ATTRIBID as key value
 Data from HBase cache   Data loaded from disk
Phoenix Skip Scan  1.7 sec 37 sec
HBase Batched Gets 4.0 sec 82 sec
HBase Range Scan-  10+ min

Details
---
HBase 0.94.7 Hadoop 1.04
Total number of regions: 30 spread on 4 Region Servers (6 core W3680 Xeon 
3.3GHz) with 8GB heap.

Data:
20 FIELDTYPE, 50M OBJECTID for each FIELDTYPE, 10 ATTRIBID. VAL is random 
integer.

Query:
SELECT AVG(VAL) FROM T1
WHERE OBJECTID IN (250K RANDOM OBJECTIDs) AND FIELDTYPE = 'F1' AND ATTRIBID = 
'1'

Create table DML:

1. CREATE TABLE IF NOT EXISTS T1 (
   OBJECTID INTEGER NOT NULL,
   FIELDTYPE CHAR(2) NOT NULL,
   ATTRIBID INTEGER NOT NULL,
   CF.VAL INTEGER
   CONSTRAINT PK PRIMARY KEY (OBJECTID,FIELDTYPE,ATTRIBID))
   COMPRESSION='GZ', BLOCKSIZE='4096'

2. CREATE TABLE IF NOT EXISTS T2 (
   OBJECTID INTEGER NOT NULL,
   FIELDTYPE CHAR(2) NOT NULL,
   CF.ATTRIBID INTEGER,
   CF.VAL INTEGER
   CONSTRAINT PK PRIMARY KEY (OBJECTID,FIELDTYPE))
   COMPRESSION='GZ', BLOCKSIZE='4096'

On 04/25/2013 04:19 PM, Sudarshan Kadambi (BLOOMBERG/ 731 LEXIN) wrote:


James: First of all, this looks quite promising.

The table schema outlined in your other message is correct except that 
attrib_id will not be in the primary key. Will that be a problem with respect 
to the skip-scan filter's performance? (it doesn't seem like it...)

Could you share any sort of benchmark numbers? I want to try this out right 
away, but I've to wait for my cluster administrator to upgrade us from HBase 
0.92 first!

- Original Message -
From: user@hbase.apache.org
To: user@hbase.apache.org
At: Apr 25 2013 18:45:14

On 04/25/2013 03:35 PM, Gary Helmling wrote:

I'm looking to write a service that runs alongside the region servers and
acts a proxy b/w my application and the region servers.

I plan to use the logic in HBase client's HConnectionManager, to segment
my request of 1M rowkeys into sub-requests per region-server. These are
sent over to the proxy which fetches the data from the region server,
aggregates locally and sends data back. Does this sound reasonable or even
a useful thing to pursue?



This is essentially what coprocessor endpoints (called through
HTable.coprocessorExec()) basically do.  (One difference is that there is a
parallel request per-region, not per-region server, though that is a
potential optimization that could be made as well).

The tricky part I see for the case you describe is splitting your full set
of row keys up correctly per region.  You could send the full set of row
keys to each endpoint invocation, and have the endpoint implementation
filter down to only those keys present in the current region.  But that
would be a lot of overhead on the request side.  You could split the row
keys into per-region sets on the client side, but I'm not sure we provide
sufficient context for the Batch.Callable instance you provide to
coprocessorExec() to determine which region it is being invoked against.

Sudarshan,
In our head branch of Phoenix (we're targeting this for a 1.2 release in
two weeks), we've implemented a skip scan filter that functions similar
to a batched get, except:
1) it's more flexible in that it can jump not only from a single key to
another single key, but also from range to range
2) it's faster, about 3-4x.
3) you can use it in combination with aggregation, since it's a filter

The scan is chunked up by region and only the keys in each region are
sent, along the lines as you and Gary have described. Then the results
are merged together by the client automatically.

How would you decompose your row key into columns? Is there a time
component? Let me walk you through an example where you might have a
LONG id value plus perhaps a timestamp (it work equally well if you only
had a single column in your PK). If you provide a bit more info on your
use case, I can tailor it more exactly.

Create a schema:
  CREATE TABLE t (key BIGINT NOT NULL, ts DATE NOT NULL, data VARCHAR
CONSTRAINT pk PRIMARY KEY (key, ts));

Populate your data using our UPSERT 

Re: HBase and Datawarehouse

2013-04-30 Thread James Taylor
Phoenix will succeed if HBase succeeds. Phoenix just makes it easier to 
drive HBase to it's maximum capability. IMHO, if HBase is to make 
further gains in the OLAP space, scans need to be faster and new, more 
compressed columnar-store type block formats need to be developed.


Running inside HBase is what gives Phoenix most of its performance 
advantage. Have you seen our numbers against Impala: 
https://github.com/forcedotcom/phoenix/wiki/Performance? Drill will need 
something to efficiently execute a query plan against HBase and Phoenix 
is a good fit here.


Thanks,

James

On 04/29/2013 10:54 PM, Asaf Mesika wrote:

I think for Pheoenix truly to succeed, it's need HBase to break the JVM
Heap barrier of 12G as I saw mentioned in couple of posts. since Lots of
analytics queries utilize memory, thus since its memory is shared with
HBase, there's so much you can do on 12GB heap. On the other hand, if
Pheonix was implemented outside HBase on the same machine (like Drill or
Impala is doing), you can have 60GB for this process, running many OLAP
queries in parallel, utilizing the same data set.



On Mon, Apr 29, 2013 at 9:08 PM, Andrew Purtell apurt...@apache.org wrote:


HBase is not really intended for heavy data crunching

Yes it is. This is why we have first class MapReduce integration and
optimized scanners.

Recent versions, like 0.94, also do pretty well with the 'O' part of OLAP.

Urban Airship's Datacube is an example of a successful OLAP project
implemented on HBase: http://github.com/urbanairship/datacube

Urban Airship uses the datacube project to support its analytics stack for
mobile apps. We handle about ~10K events per second per node.


Also there is Adobe's SaasBase:
http://www.slideshare.net/clehene/hbase-and-hadoop-at-adobe

Etc.

Where an HBase OLAP application will differ tremendously from a traditional
data warehouse is of course in the interface to the datastore. You have to
design and speak in the language of the HBase API, though Phoenix (
https://github.com/forcedotcom/phoenix) is changing that.


On Sun, Apr 28, 2013 at 10:21 PM, anil gupta anilgupt...@gmail.com
wrote:


Hi Kiran,

In HBase the data is denormalized but at the core HBase is KeyValue based
database meant for lookups or queries that expect response in

milliseconds.

OLAP i.e. data warehouse usually involves heavy data crunching. HBase is
not really intended for heavy data crunching. If you want to just store
denoramlized data and do simple queries then HBase is good. For OLAP kind
of stuff, you can make HBase work but IMO you will be better off using

Hive

for  data warehousing.

HTH,
Anil Gupta


On Sun, Apr 28, 2013 at 8:39 PM, Kiran kiranvk2...@gmail.com wrote:


But in HBase data can be said to be in  denormalised state as the
methodology
used for storage is a (column family:column) based flexible schema

.Also,

from Google's  big table paper it is evident that HBase is capable of

doing

OLAP.SO where does the difference lie?



--
View this message in context:


http://apache-hbase.679495.n3.nabble.com/HBase-and-Datawarehouse-tp4043172p4043216.html

Sent from the HBase User mailing list archive at Nabble.com.


--
Best regards,

- Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)





Re: Read access pattern

2013-04-30 Thread James Taylor
bq. The downside that I see, is the bucket_number that we have to 
maintain both at time or reading/writing and update it in case of 
cluster restructuring.


I agree that this maintenance can be painful. However, Phoenix 
(https://github.com/forcedotcom/phoenix) now supports salting, 
automating this maintenance.  If you want to salt your table, just add a 
SALT_BUCKETS = n property at the end of your DDL statement, where n 
is the total number of buckets (up to a max of 256).  For example:


CREATE TABLE t (date_time DATE NOT NULL, event_id CHAR(15) NOT NULL
CONSTRAINT pk PRIMARY KEY (date_time, event_id))
SALT_BUCKETS=10;

This will add one byte at the beginning of your row key whose value is 
formed by hashing the row key and mod-ing with 10. This will 
automatically be done for any upsert and queries will automatically be 
distributed and the results combined as expected.


Thanks,

James
@JamesPlusPlus
http://phoenix-hbase.blogspot.com/

On 04/30/2013 09:17 AM, Shahab Yunus wrote:

Well those are *some* words :) Anyway, can you explain a bit in detail that
why you feel so strongly about this design/approach? The salting here is
not the only option mentioned and static hashing can be used as well. Plus
even in case of salting, wouldn't the distributed scan take care of it? The
downside that I see, is the bucket_number that we have to maintain both at
time or reading/writing and update it in case of cluster restructuring.

Thanks,
Shahab


On Tue, Apr 30, 2013 at 11:57 AM, Michael Segel
michael_se...@hotmail.comwrote:


Geez that's a bad article.
Never salt.

And yes there's a difference between using a salt and using the first 2-4
bytes from your MD5 hash.

(Hint: Salts are random. Your hash isn't. )

Sorry to be-itch but its a bad idea and it shouldn't be propagated.

On Apr 29, 2013, at 10:17 AM, Shahab Yunus shahab.yu...@gmail.com wrote:


I think you cannot use the scanner simply to to a range scan here as your
keys are not monotonically increasing. You need to apply logic to
decode/reverse your mechanism that you have used to hash your keys at the
time of writing. You might want to check out the SemaText library which
does distributed scans and seem to handle the scenarios that you want to
implement.


http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/


On Mon, Apr 29, 2013 at 11:03 AM, ri...@laposte.net wrote:


Hi,

I have a rowkey defined by :
getMD5AsHex(Bytes.toBytes(myObjectId)) + String.format(%19d\n,
(Long.MAX_VALUE - changeDate.getTime()));

How could I get the previous and next row for a given rowkey ?
For instance, I have the following ordered keys :

3db1b6c1e7e7d2ece41ff2184f76*9223370673172227807
3db1b6c1e7e7d2ece41ff2184f76*9223370674468022807

3db1b6c1e7e7d2ece41ff2184f76*9223370674468862807

3db1b6c1e7e7d2ece41ff2184f76*9223370674984237807
3db1b6c1e7e7d2ece41ff2184f76*9223370674987271807

If I choose the rowkey :
3db1b6c1e7e7d2ece41ff2184f76*9223370674468862807, what would be the
correct scan to get the previous and next key ?
Result would be :
3db1b6c1e7e7d2ece41ff2184f76*9223370674468022807
3db1b6c1e7e7d2ece41ff2184f76*9223370674984237807

Thank you !
R.

Une messagerie gratuite, garantie à vie et des services en plus, ça vous
tente ?
Je crée ma boîte mail www.laposte.net







  1   2   >