Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? It'll
use all of the parts of your row key and depending on how much data you're
returning back to the client, will query over 10 million row in seconds.
James
@JamesPlusPlus
http://phoenix-hbase.blogspot.com
On Apr 30,
On 04/25/2013 03:35 PM, Gary Helmling wrote:
I'm looking to write a service that runs alongside the region servers and
acts a proxy b/w my application and the region servers.
I plan to use the logic in HBase client's HConnectionManager, to segment
my request of 1M rowkeys into sub-requests per
Thanks for the additional info, Sudarshan. This would fit well with the
implementation of Phoenix's skip scan.
CREATE TABLE t (
object_id INTEGER NOT NULL,
field_type INTEGER NOT NULL,
attrib_id INTEGER NOT NULL,
value BIGINT
CONSTRAINT pk PRIMARY KEY (object_id, field_type,
Our performance engineer, Mujtaba Chohan has agreed to put together a
benchmark for you. We only have a four node cluster of pretty average
boxes, but it should give you an idea.
No performance impact for the attrib_id not being part of the PK since
you're not filtering on them (if I
Phoenix will parallelize within a region:
SELECT count(1) FROM orders
I agree with Ted, though, even serially, 100,000 rows shouldn't take any where
near 6 mins. You say 100,000 rows. Can you tell us what it's ?
Thanks,
James
On Apr 19, 2013, at 2:37 AM, Ted Yu yuzhih...@gmail.com wrote:
. does
your filter utilize hint ?
It would be easier for me and other people to reproduce the issue you
experienced if you put your scenario in some test similar to
TestJoinedScanners.
Will take a closer look at the code Monday.
Cheers
On Sun, Apr 7, 2013 at 11:37 AM, James Taylor jtay
Hi Greame,
Are you familiar with Phoenix (https://github.com/forcedotcom/phoenix),
a SQL skin over HBase? We've just introduced a new feature (still in the
master branch) that'll do what you're looking for: transparently doing a
skip scan over the chunks of your HBase data based on your SQL
would be larger lazy CFs or/and low percentage of values
selected.
Can you try to increase the 2nd CF values' size and rerun the test?
On Mon, Apr 8, 2013 at 10:38 AM, James Taylor jtay...@salesforce.comwrote:
In the TestJoinedScanners.java, is the 40% randomly distributed or
sequential?
In our
Hello,
We're doing some performance testing of the essential column family
feature, and we're seeing some performance degradation when comparing
with and without the feature enabled:
Performance of scan relative
% of rows selectedto not enabling the feature
Max Lapan tried to address has non essential column family
carrying considerably more data compared to essential column family.
Cheers
On Sat, Apr 6, 2013 at 11:05 PM, James Taylor jtay...@salesforce.comwrote:
Hello,
We're doing some performance testing of the essential column family
feature
From the SQL perspective, handling null is important. Phoenix supports
null in the following way:
- the absence of a key value
- an empty value in a key value
- an empty value in a multi part row key
- for variable length types (VARCHAR and DECIMAL) a null byte
separator would be used if not
On 04/01/2013 04:41 PM, Nick Dimiduk wrote:
On Mon, Apr 1, 2013 at 4:31 PM, James Taylor jtay...@salesforce.com wrote:
From the SQL perspective, handling null is important.
From your perspective, it is critical to support NULLs, even at the expense
of fixed-width encodings at all
Mohith,
Are you wanting to reduce the amount of data you're scanning and bring
down your query time when:
- you have a row key has a multi-part row key of a string and time value and
- you know the prefix of the string and a range of the time value?
That's possible (but not easy) to do with
Another one to add to your list:
6. Phoenix (https://github.com/forcedotcom/phoenix)
Thanks,
James
On Mar 20, 2013, at 2:50 AM, Vivek Mishra vivek.mis...@impetus.co.in wrote:
I have used Kundera, persistence overhead on HBase API is minimal considering
feature set available for use within
Hi Nick,
What do you mean by hashing algorithms?
Thanks,
James
On 03/15/2013 10:11 AM, Nick Dimiduk wrote:
Hi David,
Native support for a handful of hashing algorithms has also been discussed.
Do you think these should be supported directly, as opposed to using a
fixed-length String or
Another possible solution for you: use Phoenix:
https://github.com/forcedotcom/phoenix
Phoenix would allow you to model your scenario using SQL through JDBC,
like this:
Connection conn = DriverManager.connect(jdbc:phoenix:your zookeeper
quorum);
Statement stmt = conn.createStatement(
Check your logs for whether your end-point coprocessor is hitting
zookeeper on every invocation to figure out the region start key.
Unfortunately (at least last time I checked), the default way of
invoking an end point coprocessor doesn't use the meta cache. You can go
through a combination of
, Ted Yu yuzhih...@gmail.com wrote:
I ran test suite and they passed:
Tests run: 452, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] BUILD SUCCESS
Good job.
On Mon, Feb 25, 2013 at 9:35 AM, James Taylor jtay
., but it
illustrates the idea.
On 02/26/2013 09:59 AM, Ted Yu wrote:
In the first graph on the performance page, what does 'key filter'
represent ?
Thanks
On Tue, Feb 26, 2013 at 9:53 AM, James Taylor jtay...@salesforce.comwrote:
Both Phoenix and Impala provide SQL as a way to get at your data. Here
You can query existing tables if the data is serialized in the way that
Phoenix expects. For more detailed information and options, check out
my response to this issue:
https://github.com/forcedotcom/phoenix/issues/30 and check out our Data
Type language reference here:
We are pleased to announce the immediate availability of Phoenix v 1.1,
with support for HBase v 0.94.4 and above. Phoenix is a SQL layer on top
of HBase. For details, see our announcement here:
http://phoenix-hbase.blogspot.com/2013/02/annoucing-phoenix-v-11-support-for.html
Thanks,
James
Same with us on Phoenix - we use the setAttribute on the client side and
the getAttribute on the server side to pickup state on the Scan being
executed. Works great. One thing to keep in mind, though: for a region
observer coprocessor, the state you set on the client side will be sent
to each
Unless I'm doing something wrong, it looks like the Maven repository
(http://mvnrepository.com/artifact/org.apache.hbase/hbase) only contains
HBase up to 0.94.3. Is there a different repo I should use, or if not,
any ETA on when it'll be updated?
James
Hello,
Have you considered using Phoenix
(https://github.com/forcedotcom/phoenix) for this use case? Phoenix is a
SQL layer on top of HBase. For this use case, you'd connect to your
cluster like this:
Class.forName(com.salesforce.phoenix.jdbc.PhoenixDriver); // register
driver
Connection
spotting when using time as the key. Or the problem with always
adding data to the right of the last row.
The same would apply with the project id, assuming that it too is a number that
grows incrementally with each project.
On Feb 17, 2013, at 4:50 PM, James Taylor jtay...@salesforce.com wrote
IMO, I don't think it's safe to change the KV in-place. We always create a new
KV in our coprocessors.
James
On Feb 12, 2013, at 6:41 AM, Mesika, Asaf asaf.mes...@gmail.com wrote:
I'm seeing a very strange behavior:
If I run a scan during major compaction, I can see both the modified Delta
In 0.94.2, if the coprocessor class was on the HBase classpath, then the
jarFilePath argument to HTableDescriptor.addCoprocessor seemed to
essentially be ignored - it didn't matter if the jar could be found or
not. In 0.94.4 we're getting an error if this is the case. Is there a
way to
Filed https://issues.apache.org/jira/browse/HBASE-7805
Test case attached
It occurs only if the table has a region observer coprocessor.
James
On 02/09/2013 11:04 AM, lars hofhansl wrote:
If I execute in parallel multiple scans to different parts of the same region,
they appear to be
- Original Message -
From: James Taylor jtay...@salesforce.com
To: user@hbase.apache.org user@hbase.apache.org; lars hofhansl
la...@apache.org
Cc:
Sent: Friday, February 8, 2013 9:52 PM
Subject: Re: independent scans to same region processed serially
All data is the blockcache
Wanted to check with folks and see if they've seen an issue around this
before digging in deeper. I'm on 0.94.2. If I execute in parallel
multiple scans to different parts of the same region, they appear to be
processed serially. It's actually faster from the client side to execute
a single
(https://issues.apache.org/jira/browse/HBASE-7336).Fixed 0.94.4.
I assume you have enough handlers, etc. (i.e. does the same happen if issue
multiple scan request across different region of the same region server?)
-- Lars
From: James Taylor jtay
Another approach would be to use Phoenix
(http://github.com/forcedotcom/phoenix). You can model your schema as
you would in the relational world, but you get the horizontal
scalability of HBase.
James
On 02/06/2013 01:49 PM, Michael Segel wrote:
Overloading the time stamp aka the
...@mapbased.comwrote:
Great tool,I will try it later. thanks for sharing!
2013/1/31 Devaraj Das d...@hortonworks.com
Congratulations, James. We will surely benefit from this tool.
On Wed, Jan 30, 2013 at 1:04 PM, James Taylor jtay...@salesforce.com
wrote:
We are pleased to announce the immediate
If you run a SQL query that does aggregation (i.e. uses a built-in
aggregation function like COUNT or does a GROUP BY), Phoenix will
orchestrate the running of a set of queries in parallel, segmented along
your row key (driven by the start/stop key plus region boundaries). We
take advantage of
roadmap:
https://github.com/forcedotcom/phoenix/wiki#wiki-roadmap
We welcome feedback and contributions from the community to Phoenix and
look forward to working together.
Regards,
James Taylor
@JamesPlusPlus
No, there's no sorted dimension. This would be a full table scan over
40M rows. This assumes the following:
1) your regions are evenly distributed across a four node cluster
2) unique combinations of month * scene are small enough to fit into memory
3) you chunk it up on the client side and run
iwannaplay games funnlearnforkids@... writes:
Hi ,
I want to run query like
select month(eventdate),scene,count(1),sum(timespent) from eventlog
group by month(eventdate),scene
in hbase.Through hive its taking a lot of time for 40 million
records.Do we have any syntax in hbase to find
We're seen reasonable performance, with the caveat that you need to
parallelize the scan doing the aggregation. In our benchmarking, we have
the client scan each region in parallel and have a coprocessor aggregate
the row count and return a single row back (with the client then
totaling the
101 - 138 of 138 matches
Mail list logo