StAck,
Just because FB does something doesn't mean its necessarily a good idea for
others to do the same. FB designs specifically for their needs and their use
cases may not match those of others.
To your point though, I agree that Ted's number of 3 is more of a rule of thumb
and not a
For the record, the refGuide mentions potential issues of CF lumpiness
that you mentioned:
http://hbase.apache.org/book.html#number.of.cfs
6.2.1. Cardinality of ColumnFamilies
Where multiple ColumnFamilies exist in a single table, be aware of the
cardinality (i.e., number of rows).
If
Thanks all for your suggestions. Since I am using Hbase-0.94, looks like master
does not persis the balance state.
Since I am still benchmarking my cluster... I chose to bump up value of
hbase.balancer.period property to a very big number.
Thanks,
Akshay
I made the following change in TestJoinedScanners.java:
- int flag_percent = 1;
+ int flag_percent = 40;
The test took longer but still favors joined scanner.
I got some new results:
2013-04-08 07:46:06,959 INFO [main] regionserver.TestJoinedScanners(157):
Slow scanner finished in
Agree here. The effectiveness depends on what % of data satisfies the
condition, how it is distributed across HFile blocks. We will get
performance gain when the we will be able to skip some HFile blocks (from
non essential CFs). Can test with different HFile block size (lower value)?
-Anoop-
Something I'm not getting, why not using separate tables instead of
CFs for a single table? Simply name your table tablename_cfname then
you get ride of the CF# limitation?
Or is there big pros to have CFs?
JM
2013/4/8 Anoop John anoop.hb...@gmail.com:
Agree here. The effectiveness depends on
In the TestJoinedScanners.java, is the 40% randomly distributed or
sequential?
In our test, the % is randomly distributed. Also, our custom filter does
the same thing that SingleColumnValueFilter does. On the client-side,
we'd execute the query in parallel, through multiple scans along the
bq. is the 40% randomly distributed or sequential?
Looks like the distribution is striped:
if (i % 100 = flag_percent) {
put.add(cf_essential, col_name, flag_yes);
In each stripe, it is sequential.
Let me try simulating random distribution.
On Mon, Apr 8, 2013 at 10:38 AM,
bq. through multiple scans along the region boundaries
Sorry am not able to get what you are saying. Could you elaborate on this?
I think the validity of this essential CF feature is best tested in real
use cases as that in Phoenix.
Regards
Ram
On Mon, Apr 8, 2013 at 11:12 PM, Ted Yu
I adopted random distribution for 30% of the rows which were selected.
I still saw meaningful improvement from joined scanners:
2013-04-08 10:54:13,819 INFO [main] regionserver.TestJoinedScanners(158):
Slow scanner finished in 6.20723 seconds, got 1552 rows
...
2013-04-08 10:54:18,801 INFO
I think that JM brings up a good point.
Keep in mind that RLL in HBase is not the same when you think of Row Level
Locking in transactional systems.
Depending on the use case... you can keep things in separate tables and not
worry about the issues w CF's.
So when you think about your
Hi,
Maybe there is an obvious way but i'm not seeing it.
I have a need to query HBase for multiple chunks of data, that is something
equivalent to
select columns
from table
where rowid between A and B
or rowid between C and D
or rowid between E and F
etc.
in SQL.
Whats the best way to go
I thought a Scan could only cope with one start row and an end row ?
On Mon, Apr 8, 2013 at 1:27 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org
wrote:
Hi Greame,
The scans are the right way to do that.
They will give you back all the data you need, chunck by chunk. Then
yoiu have to
That's exact.
In your situation, you will have to create 3 scans.
One with startRow from A and endrow to B
One with startRow from C and endrow to D
One with startRow from E and endrow to F
You can even do then in parallele if you want.
JM
2013/4/8 Graeme Wallace
For Scan:
* To add a filter, execute {@link
#setFilter(org.apache.hadoop.hbase.filter.Filter) setFilter}.
Take a look at RowFilter:
* This filter is used to filter based on the key. It takes an operator
* (equal, greater, not equal, etc) and a byte [] comparator for the row,
You can
Hi Greame,
Are you familiar with Phoenix (https://github.com/forcedotcom/phoenix),
a SQL skin over HBase? We've just introduced a new feature (still in the
master branch) that'll do what you're looking for: transparently doing a
skip scan over the chunks of your HBase data based on your SQL
Right - but is there a way of not tying up calling threads on the client
side - and pushing the information to the region servers so that they know
what rows to examine ?
Would this be possible in a co-processor ? (i admit i havent read up on
them yet)
On Mon, Apr 8, 2013 at 1:36 PM,
Hi Graeme,
Each time filterRowKey will return true, the entire row will be
skipped, so the data related to this row will not be read. However,
there might still be some disk access if everything is not in memory,
but not more than if you are doing a regular scan without any
filter.
I still think
IntegrationTestLazyCfLoading uses randomly distributed keys with the
following condition for filtering:
1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) 1); where rowKey
is hex string of MD5 key.
Then, there are 2 lazy CFs, each of which has a value of 4-64k.
This test also showed
I forgot to mention that, assuming A is the smallest row key, you can use
the following method (of Scan) to narrow the rows scanned:
public Scan setStartRow(byte [] startRow) {
Another related feature is HBASE-6509: Implement fast-forwarding
FuzzyRowFilter to allow filtering rows e.g. by
I have a needle-in-the-haystack type scan. I have tried to read all the
issues with ScannerTimeoutException and LeaseException, but do have not
seen anyone report what I am seeing.
Running 0.92.1-cdh4.1.1. All config wrt to timeouts and periods are
default: 60s.
When I run a scanner that
bq. additional cost of seeks/merging the results from two CFs outweights
the benefit of lazy loading on such small values
This was my thinking as well.
HRegion#nextInternal() operation is local to the underlying region. This
makes it difficult for this method to adjust scanning behavior
0.92.1 is pretty old. Are you able to deploy newer release, e.g. 0.94.6.1
and see if the problem can be reproduced ?
Otherwise we have two choices:
1. write a unit / integration test that shows this bug
2. see more of the region server / client logs so that further analysis can
be performed.
In this case it is handled all at the server, and if doing scans you still get
the benefits of the sequential access pattern (rather doing a lot of seeks for
point Gets).
-- Lars
From: Jean-Marc Spaggiari jean-m...@spaggiari.org
To: user@hbase.apache.org
We've had some discussions about turning a set of Gets into (smaller set of
Scans). That is only partially applicable here, though.
In your case I think you have two options:
1. Fire off multiple scans. You can do that in parallel from the client. Each
one will hone in to the start row with
One of James' motivation was to always be able to enable scanners to make use
of essential column families (and thus avoid HBase API version - essential
column families was added only in 0.94.5+).
Sounds like general answer to this is: No you shouldn't. It should still be a
per query option, or
Hi,
If I perform a full row Delete using the Delete API for a row and then
after few milliseconds, issue a Put(row, Map of columns, values) - will
that go through assuming that timestamps are applied in increasing order ?
Thanks
Varun
Yes, since you say after a few milliseconds assuming you did not
specify a timestamp in the Put request which is earlier than the one row
had before delete was issued.
I have been bitten by this in my unit tests, doing a delete followed by a
put quickly. But in my case timestamp was within same
Good idea, Sergey. We'll rerun with larger non essential column family
values and see if there's a crossover point. One other difference for us
is that we're using FAST_DIFF encoding. We'll try with no encoding too.
Our table has 20 million rows across four regions servers.
Regarding the
Hi, Jean
I guess Greame maybe asked whether Hbase can support parallel
scan provided by client API, because currently, client API doesn't
provide concurrent access if the range query crosses multi-regions?
Now, if we want to support parallel scan, we have to use
coprocessor to implement,
Randy
As Ted suggested can you see the client logs closely (RS side also)? Is there
next() call retries happening from the client side because of RPC timeouts?
In such a case this kind of issue can happen. I doubt he hit HBASE-5974
-Anoop-
From: Ted
31 matches
Mail list logo