This is even better when you don't necessary care about the
order of every row, but want every row in a given range (then you can
just get whatever row is available from a buffer in the client).
You do realize that in the general case you want to return the result set in
sort order.
So you
1) You can still query in sorted order, in which case N scans is
beneficial. (In our tests: ~25% faster for N=2, going up to about ~50%
faster for N=16.)
2) Many times you would issue a scan without necessarily caring about
individual record order. (e.g.: let me perform some operation on all
You have n different scans and you then have to put the rows in sort order from
each scan in to a single result set.
While in each scan, the RS is in sort order, the overall set of RS needs to be
merged in to one RS and that’s where you start to have issues.
Again YMMV…
And again… depending
On Mon, May 19, 2014 at 8:53 AM, Michael Segel
michael_se...@hotmail.com wrote:
While in each scan, the RS is in sort order, the overall set of RS needs to
be merged in to one RS and that’s where you start to have issues.
What issues? As I said, in multiple tests we saw performance
You run n scans in parallel.
You want a single result set in sort order.
How do you do that?
(Rhetorical)
That’s the extra work that you don’t have when you have a single result set.
This goes in to why the work done for secondary indexing to be associated with
the base table won’t scale
I think I should dust off my schema design talk… clearly the talks given by
some of the vendors don’t really explain things …
(Hmmm. Strata London?)
See my reply below…. Note I used SHA-1. MD-5 should also give you roughly the
same results.
On May 18, 2014, at 4:28 AM, Software Dev
You may be missing the point. The primary reason for the salt prefix
pattern is to avoid hotspotting when inserting time series data AND at
the same time provide a way to perform range scans.
on this subject and realized my second question may
not be appropriate since this prefix salting pattern assumes that the
prefix is random. I thought it was actually based off a hash that
could be predetermined so you could alwasy, if needed, get to the
exact row key with one get. Would
No, you’re missing the point.
Its not a good idea or design.
Is your data mutable or static?
To your point. Everytime you want to do a simple get() you have to open up n
get() statements. On your range scans you will have to do n range scans, then
join and sort the result sets. The fact that
In our measurements, scanning is improved by performing against n
range scans rather than 1 (since you are effectively striping the
reads). This is even better when you don't necessary care about the
order of every row, but want every row in a given range (then you can
just get whatever row is
@Software Dev - might be feasible to implement a Thrift client that speaks
Phoenix JDBC. I believe this is similar to what Hive has done.
Thanks,
James
On Sun, May 18, 2014 at 1:19 PM, Mike Axiak m...@axiak.net wrote:
In our measurements, scanning is improved by performing against n
range
I recently came across the pattern of adding a salting prefix to the
row keys to prevent hotspotting. Still trying to wrap my head around
it and I have a few questions.
- Is there ever a reason to salt to more buckets than there are region
servers? The only reason why I think that may be
Well kept reading on this subject and realized my second question may
not be appropriate since this prefix salting pattern assumes that the
prefix is random. I thought it was actually based off a hash that
could be predetermined so you could alwasy, if needed, get to the
exact row key with one get
and realized my second question may
not be appropriate since this prefix salting pattern assumes that the
prefix is random. I thought it was actually based off a hash that
could be predetermined so you could alwasy, if needed, get to the
exact row key with one get. Would there be something wrong
14 matches
Mail list logo