Re: Very poor read performance with composite keys in hbase

2013-04-30 Thread James Taylor
Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? It'll use all of the parts of your row key and depending on how much data you're returning back to the client, will query over 10 million row in seconds. James @JamesPlusPlus http://phoenix-hbase.blogspot.com On Apr 30,

Re: Coprocessors

2013-04-25 Thread James Taylor
On 04/25/2013 03:35 PM, Gary Helmling wrote: I'm looking to write a service that runs alongside the region servers and acts a proxy b/w my application and the region servers. I plan to use the logic in HBase client's HConnectionManager, to segment my request of 1M rowkeys into sub-requests per

Re: Coprocessors

2013-04-25 Thread James Taylor
Thanks for the additional info, Sudarshan. This would fit well with the implementation of Phoenix's skip scan. CREATE TABLE t ( object_id INTEGER NOT NULL, field_type INTEGER NOT NULL, attrib_id INTEGER NOT NULL, value BIGINT CONSTRAINT pk PRIMARY KEY (object_id, field_type,

Re: Coprocessors

2013-04-25 Thread James Taylor
Our performance engineer, Mujtaba Chohan has agreed to put together a benchmark for you. We only have a four node cluster of pretty average boxes, but it should give you an idea. No performance impact for the attrib_id not being part of the PK since you're not filtering on them (if I

Re: Speeding up the row count

2013-04-19 Thread James Taylor
Phoenix will parallelize within a region: SELECT count(1) FROM orders I agree with Ted, though, even serially, 100,000 rows shouldn't take any where near 6 mins. You say 100,000 rows. Can you tell us what it's ? Thanks, James On Apr 19, 2013, at 2:37 AM, Ted Yu yuzhih...@gmail.com wrote:

Re: Essential column family performance

2013-04-08 Thread James Taylor
. does your filter utilize hint ? It would be easier for me and other people to reproduce the issue you experienced if you put your scenario in some test similar to TestJoinedScanners. Will take a closer look at the code Monday. Cheers On Sun, Apr 7, 2013 at 11:37 AM, James Taylor jtay

Re: Best way to query multiple sets of rows

2013-04-08 Thread James Taylor
Hi Greame, Are you familiar with Phoenix (https://github.com/forcedotcom/phoenix), a SQL skin over HBase? We've just introduced a new feature (still in the master branch) that'll do what you're looking for: transparently doing a skip scan over the chunks of your HBase data based on your SQL

Re: Essential column family performance

2013-04-08 Thread James Taylor
would be larger lazy CFs or/and low percentage of values selected. Can you try to increase the 2nd CF values' size and rerun the test? On Mon, Apr 8, 2013 at 10:38 AM, James Taylor jtay...@salesforce.comwrote: In the TestJoinedScanners.java, is the 40% randomly distributed or sequential? In our

Essential column family performance

2013-04-07 Thread James Taylor
Hello, We're doing some performance testing of the essential column family feature, and we're seeing some performance degradation when comparing with and without the feature enabled: Performance of scan relative % of rows selectedto not enabling the feature

Re: Essential column family performance

2013-04-07 Thread James Taylor
Max Lapan tried to address has non essential column family carrying considerably more data compared to essential column family. Cheers On Sat, Apr 6, 2013 at 11:05 PM, James Taylor jtay...@salesforce.comwrote: Hello, We're doing some performance testing of the essential column family feature

Re: HBase Types: Explicit Null Support

2013-04-01 Thread James Taylor
From the SQL perspective, handling null is important. Phoenix supports null in the following way: - the absence of a key value - an empty value in a key value - an empty value in a multi part row key - for variable length types (VARCHAR and DECIMAL) a null byte separator would be used if not

Re: HBase Types: Explicit Null Support

2013-04-01 Thread James Taylor
On 04/01/2013 04:41 PM, Nick Dimiduk wrote: On Mon, Apr 1, 2013 at 4:31 PM, James Taylor jtay...@salesforce.com wrote: From the SQL perspective, handling null is important. From your perspective, it is critical to support NULLs, even at the expense of fixed-width encodings at all

Re: Understanding scan behaviour

2013-03-29 Thread James Taylor
Mohith, Are you wanting to reduce the amount of data you're scanning and bring down your query time when: - you have a row key has a multi-part row key of a string and time value and - you know the prefix of the string and a range of the time value? That's possible (but not easy) to do with

Re: HBase Client.

2013-03-20 Thread James Taylor
Another one to add to your list: 6. Phoenix (https://github.com/forcedotcom/phoenix) Thanks, James On Mar 20, 2013, at 2:50 AM, Vivek Mishra vivek.mis...@impetus.co.in wrote: I have used Kundera, persistence overhead on HBase API is minimal considering feature set available for use within

Re: HBase type support

2013-03-15 Thread James Taylor
Hi Nick, What do you mean by hashing algorithms? Thanks, James On 03/15/2013 10:11 AM, Nick Dimiduk wrote: Hi David, Native support for a handful of hashing algorithms has also been discussed. Do you think these should be supported directly, as opposed to using a fixed-length String or

Re: Rowkey design and presplit table

2013-03-07 Thread James Taylor
Another possible solution for you: use Phoenix: https://github.com/forcedotcom/phoenix Phoenix would allow you to model your scenario using SQL through JDBC, like this: Connection conn = DriverManager.connect(jdbc:phoenix:your zookeeper quorum); Statement stmt = conn.createStatement(

Re: endpoint coprocessor performance

2013-03-04 Thread James Taylor
Check your logs for whether your end-point coprocessor is hitting zookeeper on every invocation to figure out the region start key. Unfortunately (at least last time I checked), the default way of invoking an end point coprocessor doesn't use the meta cache. You can go through a combination of

Re: Announcing Phoenix v 1.1: Support for HBase v 0.94.4 and above

2013-02-26 Thread James Taylor
, Ted Yu yuzhih...@gmail.com wrote: I ran test suite and they passed: Tests run: 452, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] BUILD SUCCESS Good job. On Mon, Feb 25, 2013 at 9:35 AM, James Taylor jtay

Re: Announcing Phoenix v 1.1: Support for HBase v 0.94.4 and above

2013-02-26 Thread James Taylor
., but it illustrates the idea. On 02/26/2013 09:59 AM, Ted Yu wrote: In the first graph on the performance page, what does 'key filter' represent ? Thanks On Tue, Feb 26, 2013 at 9:53 AM, James Taylor jtay...@salesforce.comwrote: Both Phoenix and Impala provide SQL as a way to get at your data. Here

Re: Announcing Phoenix v 1.1: Support for HBase v 0.94.4 and above

2013-02-26 Thread James Taylor
You can query existing tables if the data is serialized in the way that Phoenix expects. For more detailed information and options, check out my response to this issue: https://github.com/forcedotcom/phoenix/issues/30 and check out our Data Type language reference here:

Announcing Phoenix v 1.1: Support for HBase v 0.94.4 and above

2013-02-25 Thread James Taylor
We are pleased to announce the immediate availability of Phoenix v 1.1, with support for HBase v 0.94.4 and above. Phoenix is a SQL layer on top of HBase. For details, see our announcement here: http://phoenix-hbase.blogspot.com/2013/02/annoucing-phoenix-v-11-support-for.html Thanks, James

Re: attributes - basic question

2013-02-22 Thread James Taylor
Same with us on Phoenix - we use the setAttribute on the client side and the getAttribute on the server side to pickup state on the Scan being executed. Works great. One thing to keep in mind, though: for a region observer coprocessor, the state you set on the client side will be sent to each

availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread James Taylor
Unless I'm doing something wrong, it looks like the Maven repository (http://mvnrepository.com/artifact/org.apache.hbase/hbase) only contains HBase up to 0.94.3. Is there a different repo I should use, or if not, any ETA on when it'll be updated? James

Re: Row Key Design in time based aplication

2013-02-17 Thread James Taylor
Hello, Have you considered using Phoenix (https://github.com/forcedotcom/phoenix) for this use case? Phoenix is a SQL layer on top of HBase. For this use case, you'd connect to your cluster like this: Class.forName(com.salesforce.phoenix.jdbc.PhoenixDriver); // register driver Connection

Re: Row Key Design in time based aplication

2013-02-17 Thread James Taylor
spotting when using time as the key. Or the problem with always adding data to the right of the last row. The same would apply with the project id, assuming that it too is a number that grows incrementally with each project. On Feb 17, 2013, at 4:50 PM, James Taylor jtay...@salesforce.com wrote

Re: Custom preCompact RegionObserver crashes entire cluster on OOME: Heap Space

2013-02-12 Thread James Taylor
IMO, I don't think it's safe to change the KV in-place. We always create a new KV in our coprocessors. James On Feb 12, 2013, at 6:41 AM, Mesika, Asaf asaf.mes...@gmail.com wrote: I'm seeing a very strange behavior: If I run a scan during major compaction, I can see both the modified Delta

jarFilePath for HTableDescriptor.addCoprocessor() with 0.94.2 vs 0.94.4

2013-02-11 Thread James Taylor
In 0.94.2, if the coprocessor class was on the HBase classpath, then the jarFilePath argument to HTableDescriptor.addCoprocessor seemed to essentially be ignored - it didn't matter if the jar could be found or not. In 0.94.4 we're getting an error if this is the case. Is there a way to

Re: independent scans to same region processed serially

2013-02-10 Thread James Taylor
Filed https://issues.apache.org/jira/browse/HBASE-7805 Test case attached It occurs only if the table has a region observer coprocessor. James On 02/09/2013 11:04 AM, lars hofhansl wrote: If I execute in parallel multiple scans to different parts of the same region, they appear to be

Re: independent scans to same region processed serially

2013-02-09 Thread James Taylor
- Original Message - From: James Taylor jtay...@salesforce.com To: user@hbase.apache.org user@hbase.apache.org; lars hofhansl la...@apache.org Cc: Sent: Friday, February 8, 2013 9:52 PM Subject: Re: independent scans to same region processed serially All data is the blockcache

independent scans to same region processed serially

2013-02-08 Thread James Taylor
Wanted to check with folks and see if they've seen an issue around this before digging in deeper. I'm on 0.94.2. If I execute in parallel multiple scans to different parts of the same region, they appear to be processed serially. It's actually faster from the client side to execute a single

Re: independent scans to same region processed serially

2013-02-08 Thread James Taylor
(https://issues.apache.org/jira/browse/HBASE-7336).Fixed 0.94.4. I assume you have enough handlers, etc. (i.e. does the same happen if issue multiple scan request across different region of the same region server?) -- Lars From: James Taylor jtay

Re: How would you model this in Hbase?

2013-02-06 Thread James Taylor
Another approach would be to use Phoenix (http://github.com/forcedotcom/phoenix). You can model your schema as you would in the relational world, but you get the horizontal scalability of HBase. James On 02/06/2013 01:49 PM, Michael Segel wrote: Overloading the time stamp aka the

Re: Announcing Phoenix: A SQL layer over HBase

2013-02-01 Thread James Taylor
...@mapbased.comwrote: Great tool,I will try it later. thanks for sharing! 2013/1/31 Devaraj Das d...@hortonworks.com Congratulations, James. We will surely benefit from this tool. On Wed, Jan 30, 2013 at 1:04 PM, James Taylor jtay...@salesforce.com wrote: We are pleased to announce the immediate

Re: Parallel scan in HBase

2013-02-01 Thread James Taylor
If you run a SQL query that does aggregation (i.e. uses a built-in aggregation function like COUNT or does a GROUP BY), Phoenix will orchestrate the running of a set of queries in parallel, segmented along your row key (driven by the start/stop key plus region boundaries). We take advantage of

Announcing Phoenix: A SQL layer over HBase

2013-01-30 Thread James Taylor
roadmap: https://github.com/forcedotcom/phoenix/wiki#wiki-roadmap We welcome feedback and contributions from the community to Phoenix and look forward to working together. Regards, James Taylor @JamesPlusPlus

Re: HBase aggregate query

2012-09-13 Thread James Taylor
No, there's no sorted dimension. This would be a full table scan over 40M rows. This assumes the following: 1) your regions are evenly distributed across a four node cluster 2) unique combinations of month * scene are small enough to fit into memory 3) you chunk it up on the client side and run

Re: HBase aggregate query

2012-09-11 Thread James Taylor
iwannaplay games funnlearnforkids@... writes: Hi , I want to run query like select month(eventdate),scene,count(1),sum(timespent) from eventlog group by month(eventdate),scene in hbase.Through hive its taking a lot of time for 40 million records.Do we have any syntax in hbase to find

Re: aggregation performance

2012-05-03 Thread James Taylor
We're seen reasonable performance, with the caveat that you need to parallelize the scan doing the aggregation. In our benchmarking, we have the client scan each region in parallel and have a coprocessor aggregate the row count and return a single row back (with the client then totaling the

<    1   2