Re: schema design: rows vs wide columns

2013-04-08 Thread Michael Segel
StAck, Just because FB does something doesn't mean its necessarily a good idea for others to do the same. FB designs specifically for their needs and their use cases may not match those of others. To your point though, I agree that Ted's number of 3 is more of a rule of thumb and not a

Re: schema design: rows vs wide columns

2013-04-08 Thread Doug Meil
For the record, the refGuide mentions potential issues of CF lumpiness that you mentioned: http://hbase.apache.org/book.html#number.of.cfs 6.2.1. Cardinality of ColumnFamilies Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If

Re: Disabling balancer permanently in HBase

2013-04-08 Thread Akshay Singh
Thanks all for your suggestions. Since I am using Hbase-0.94, looks like master does not persis the balance state. Since I am still benchmarking my cluster... I chose to bump up value of hbase.balancer.period property to a very big number. Thanks, Akshay

Re: Essential column family performance

2013-04-08 Thread Ted Yu
I made the following change in TestJoinedScanners.java: - int flag_percent = 1; + int flag_percent = 40; The test took longer but still favors joined scanner. I got some new results: 2013-04-08 07:46:06,959 INFO [main] regionserver.TestJoinedScanners(157): Slow scanner finished in

Re: Essential column family performance

2013-04-08 Thread Anoop John
Agree here. The effectiveness depends on what % of data satisfies the condition, how it is distributed across HFile blocks. We will get performance gain when the we will be able to skip some HFile blocks (from non essential CFs). Can test with different HFile block size (lower value)? -Anoop-

Re: Essential column family performance

2013-04-08 Thread Jean-Marc Spaggiari
Something I'm not getting, why not using separate tables instead of CFs for a single table? Simply name your table tablename_cfname then you get ride of the CF# limitation? Or is there big pros to have CFs? JM 2013/4/8 Anoop John anoop.hb...@gmail.com: Agree here. The effectiveness depends on

Re: Essential column family performance

2013-04-08 Thread James Taylor
In the TestJoinedScanners.java, is the 40% randomly distributed or sequential? In our test, the % is randomly distributed. Also, our custom filter does the same thing that SingleColumnValueFilter does. On the client-side, we'd execute the query in parallel, through multiple scans along the

Re: Essential column family performance

2013-04-08 Thread Ted Yu
bq. is the 40% randomly distributed or sequential? Looks like the distribution is striped: if (i % 100 = flag_percent) { put.add(cf_essential, col_name, flag_yes); In each stripe, it is sequential. Let me try simulating random distribution. On Mon, Apr 8, 2013 at 10:38 AM,

Re: Essential column family performance

2013-04-08 Thread ramkrishna vasudevan
bq. through multiple scans along the region boundaries Sorry am not able to get what you are saying. Could you elaborate on this? I think the validity of this essential CF feature is best tested in real use cases as that in Phoenix. Regards Ram On Mon, Apr 8, 2013 at 11:12 PM, Ted Yu

Re: Essential column family performance

2013-04-08 Thread Ted Yu
I adopted random distribution for 30% of the rows which were selected. I still saw meaningful improvement from joined scanners: 2013-04-08 10:54:13,819 INFO [main] regionserver.TestJoinedScanners(158): Slow scanner finished in 6.20723 seconds, got 1552 rows ... 2013-04-08 10:54:18,801 INFO

Re: Essential column family performance

2013-04-08 Thread Michael Segel
I think that JM brings up a good point. Keep in mind that RLL in HBase is not the same when you think of Row Level Locking in transactional systems. Depending on the use case... you can keep things in separate tables and not worry about the issues w CF's. So when you think about your

Best way to query multiple sets of rows

2013-04-08 Thread Graeme Wallace
Hi, Maybe there is an obvious way but i'm not seeing it. I have a need to query HBase for multiple chunks of data, that is something equivalent to select columns from table where rowid between A and B or rowid between C and D or rowid between E and F etc. in SQL. Whats the best way to go

Re: Best way to query multiple sets of rows

2013-04-08 Thread Graeme Wallace
I thought a Scan could only cope with one start row and an end row ? On Mon, Apr 8, 2013 at 1:27 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Greame, The scans are the right way to do that. They will give you back all the data you need, chunck by chunk. Then yoiu have to

Re: Best way to query multiple sets of rows

2013-04-08 Thread Jean-Marc Spaggiari
That's exact. In your situation, you will have to create 3 scans. One with startRow from A and endrow to B One with startRow from C and endrow to D One with startRow from E and endrow to F You can even do then in parallele if you want. JM 2013/4/8 Graeme Wallace

Re: Best way to query multiple sets of rows

2013-04-08 Thread Ted Yu
For Scan: * To add a filter, execute {@link #setFilter(org.apache.hadoop.hbase.filter.Filter) setFilter}. Take a look at RowFilter: * This filter is used to filter based on the key. It takes an operator * (equal, greater, not equal, etc) and a byte [] comparator for the row, You can

Re: Best way to query multiple sets of rows

2013-04-08 Thread James Taylor
Hi Greame, Are you familiar with Phoenix (https://github.com/forcedotcom/phoenix), a SQL skin over HBase? We've just introduced a new feature (still in the master branch) that'll do what you're looking for: transparently doing a skip scan over the chunks of your HBase data based on your SQL

Re: Best way to query multiple sets of rows

2013-04-08 Thread Graeme Wallace
Right - but is there a way of not tying up calling threads on the client side - and pushing the information to the region servers so that they know what rows to examine ? Would this be possible in a co-processor ? (i admit i havent read up on them yet) On Mon, Apr 8, 2013 at 1:36 PM,

Re: Best way to query multiple sets of rows

2013-04-08 Thread Jean-Marc Spaggiari
Hi Graeme, Each time filterRowKey will return true, the entire row will be skipped, so the data related to this row will not be read. However, there might still be some disk access if everything is not in memory, but not more than if you are doing a regular scan without any filter. I still think

Re: Essential column family performance

2013-04-08 Thread Sergey Shelukhin
IntegrationTestLazyCfLoading uses randomly distributed keys with the following condition for filtering: 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) 1); where rowKey is hex string of MD5 key. Then, there are 2 lazy CFs, each of which has a value of 4-64k. This test also showed

Re: Best way to query multiple sets of rows

2013-04-08 Thread Ted Yu
I forgot to mention that, assuming A is the smallest row key, you can use the following method (of Scan) to narrow the rows scanned: public Scan setStartRow(byte [] startRow) { Another related feature is HBASE-6509: Implement fast-forwarding FuzzyRowFilter to allow filtering rows e.g. by

Scanner returning subset of data

2013-04-08 Thread Randy Fox
I have a needle-in-the-haystack type scan. I have tried to read all the issues with ScannerTimeoutException and LeaseException, but do have not seen anyone report what I am seeing. Running 0.92.1-cdh4.1.1. All config wrt to timeouts and periods are default: 60s. When I run a scanner that

Re: Essential column family performance

2013-04-08 Thread Ted Yu
bq. additional cost of seeks/merging the results from two CFs outweights the benefit of lazy loading on such small values This was my thinking as well. HRegion#nextInternal() operation is local to the underlying region. This makes it difficult for this method to adjust scanning behavior

Re: Scanner returning subset of data

2013-04-08 Thread Ted Yu
0.92.1 is pretty old. Are you able to deploy newer release, e.g. 0.94.6.1 and see if the problem can be reproduced ? Otherwise we have two choices: 1. write a unit / integration test that shows this bug 2. see more of the region server / client logs so that further analysis can be performed.

Re: Essential column family performance

2013-04-08 Thread lars hofhansl
In this case it is handled all at the server, and if doing scans you still get the benefits of the sequential access pattern (rather doing a lot of seeks for point Gets). -- Lars From: Jean-Marc Spaggiari jean-m...@spaggiari.org To: user@hbase.apache.org

Re: Best way to query multiple sets of rows

2013-04-08 Thread lars hofhansl
We've had some discussions about turning a set of Gets into (smaller set of Scans). That is only partially applicable here, though. In your case I think you have two options: 1. Fire off multiple scans. You can do that in parallel from the client. Each one will hone in to the start row with

Re: Essential column family performance

2013-04-08 Thread lars hofhansl
One of James' motivation was to always be able to enable scanners to make use of essential column families (and thus avoid HBase API version - essential column families was added only in 0.94.5+). Sounds like general answer to this is: No you shouldn't. It should still be a per query option, or

Full row delete followed by Put

2013-04-08 Thread Varun Sharma
Hi, If I perform a full row Delete using the Delete API for a row and then after few milliseconds, issue a Put(row, Map of columns, values) - will that go through assuming that timestamps are applied in increasing order ? Thanks Varun

Re: Full row delete followed by Put

2013-04-08 Thread Shrijeet Paliwal
Yes, since you say after a few milliseconds assuming you did not specify a timestamp in the Put request which is earlier than the one row had before delete was issued. I have been bitten by this in my unit tests, doing a delete followed by a put quickly. But in my case timestamp was within same

Re: Essential column family performance

2013-04-08 Thread James Taylor
Good idea, Sergey. We'll rerun with larger non essential column family values and see if there's a crossover point. One other difference for us is that we're using FAST_DIFF encoding. We'll try with no encoding too. Our table has 20 million rows across four regions servers. Regarding the

答复: Best way to query multiple sets of rows

2013-04-08 Thread Shixiaolong
Hi, Jean I guess Greame maybe asked whether Hbase can support parallel scan provided by client API, because currently, client API doesn't provide concurrent access if the range query crosses multi-regions? Now, if we want to support parallel scan, we have to use coprocessor to implement,

RE: Scanner returning subset of data

2013-04-08 Thread Anoop Sam John
Randy As Ted suggested can you see the client logs closely (RS side also)? Is there next() call retries happening from the client side because of RPC timeouts? In such a case this kind of issue can happen. I doubt he hit HBASE-5974 -Anoop- From: Ted