Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor
Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in

Re: secondary index feature

2014-01-03 Thread Henning Blohm
Jesse, James, Lars, after looking around a bit and in particular looking into Phoenix (which I find very interesting), assuming that you want a secondary indexing on HBASE without adding other infrastructure, there seems to be not a lot of choice really: Either go with a region-level (and

Re: secondary index feature

2014-01-03 Thread Anoop John
Is there any data on how RLI (or in particular Phoenix) query throughput correlates with the number of region servers assuming homogeneously distributed data? Phoenix is yet to add RLI. Now it is having global indexing only. Correct James? RLI impl from Huawei (HIndex) is having some numbers wrt

RE: secondary index feature

2014-01-03 Thread rajeshbabu chintaguntla
Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200| 64|199|102 50 | 200|8|199| 35 100|400 | 8| 350| 95 200| 800| 8| 353| 153 Without secondary index scan is

Re: secondary index feature

2014-01-03 Thread Anoop John
A proportional difference in time taken, wrt increase in # RSs (keeping No#rows matching values constant), would be what is of utmost interest. -Anoop- On Fri, Jan 3, 2014 at 3:49 PM, rajeshbabu chintaguntla rajeshbabu.chintagun...@huawei.com wrote: Here are some performance numbers with

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Jean-Marc Spaggiari
Interesting. This is exactly what I'm doing ;) I'm using 3 tables to achieve this. One table with the URL already crawled (80 millions), one URL with the URL to crawle (2 billions) and one URL with the URLs been processed. I'm not running any SQL requests against my dataset but I have MR jobs

Re: secondary index feature

2014-01-03 Thread ramkrishna vasudevan
What is generally of interest? RLI or global level. I know it is based on usecase but is there a common need? On Fri, Jan 3, 2014 at 4:31 PM, Anoop John anoop.hb...@gmail.com wrote: A proportional difference in time taken, wrt increase in # RSs (keeping No#rows matching values constant),

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Ted Yu
bq. One URL ... I guess you mean one table ... Cheers On Jan 3, 2014, at 4:19 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Interesting. This is exactly what I'm doing ;) I'm using 3 tables to achieve this. One table with the URL already crawled (80 millions), one URL with the

Re: secondary index feature

2014-01-03 Thread Asaf Mesika
Are the regions scanned in parallel? On Friday, January 3, 2014, rajeshbabu chintaguntla wrote: Here are some performance numbers with RLI. No Region servers : 4 Data per region: 2 GB Regions/RS| Total regions| Blocksize(kb) |No#rows matching values| Time taken(sec)| 50 | 200|

Re: secondary index feature

2014-01-03 Thread Ted Yu
I think both approaches should be provided to HBase users. These are new features that would both find proper usage scenarios. Cheers On Jan 3, 2014, at 5:48 AM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: What is generally of interest? RLI or global level. I know it is

RE: secondary index feature

2014-01-03 Thread rajeshbabu chintaguntla
No. the regions scanned sequentially. From: Asaf Mesika [asaf.mes...@gmail.com] Sent: Friday, January 03, 2014 7:26 PM To: user@hbase.apache.org Subject: Re: secondary index feature Are the regions scanned in parallel? On Friday, January 3, 2014,

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Jean-Marc Spaggiari
Yes, sorry ;) Thanks for the correction. Should have been: One table with the URL already crawled (80 millions), one table with the URL to crawle (2 billions) and one table with the URLs been processed. I'm not running any SQL requests against my dataset but I have MR jobs doing many different

Re: Snappy compression question

2014-01-03 Thread Ted Yu
See this thread: http://search-hadoop.com/m/LviZD1WPToG/Snappy+libhadoopsubj=RE+Setting+up+Snappy+compression+in+Hadoop On Jan 3, 2014, at 3:20 AM, 张玉雪 zhangyuxue123...@163.com wrote: Hi: When I used hadoop 2.2.0 and hbase 0.96.1.1 to using snappy compression I followed

Re: Snappy compression question

2014-01-03 Thread Jean-Marc Spaggiari
Shameless plug ;) http://www.spaggiari.org/index.php/hbase/how-to-install-snappy-with-1 Keep us posted. 2014/1/3 Ted Yu yuzhih...@gmail.com See this thread: http://search-hadoop.com/m/LviZD1WPToG/Snappy+libhadoopsubj=RE+Setting+up+Snappy+compression+in+Hadoop On Jan 3, 2014, at 3:20 AM,

table.jsp not found in 0.96.1.1?

2014-01-03 Thread Jean-Marc Spaggiari
Clicking on a specific table name in the Master WebUI in 0.96.1.1 give me. HTTP ERROR 404 Problem accessing /table.jsp. Reason: /table.jsp Only me? Or I should open a JIRA? JM

Re: table.jsp not found in 0.96.1.1?

2014-01-03 Thread Ted Yu
Which tar ball did you expand - hadoop1 or hadoop2 ? Cheers On Fri, Jan 3, 2014 at 10:15 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Clicking on a specific table name in the Master WebUI in 0.96.1.1 give me. HTTP ERROR 404 Problem accessing /table.jsp. Reason: /table.jsp

Re: table.jsp not found in 0.96.1.1?

2014-01-03 Thread Jean-Marc Spaggiari
Hadoop 2. 2014/1/3 Ted Yu yuzhih...@gmail.com Which tar ball did you expand - hadoop1 or hadoop2 ? Cheers On Fri, Jan 3, 2014 at 10:15 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Clicking on a specific table name in the Master WebUI in 0.96.1.1 give me. HTTP ERROR 404

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Asaf Mesika
Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. 2. Wouldn't salting ruin the sort order required? Priority, date added? On Friday,

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor
On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com wrote: Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. Not

Re: secondary index feature

2014-01-03 Thread Henning Blohm
When scanning in order of an index and you use RLI, it seems, there is no alternative but to involve all regions - and essentially this should happen in parallel as otherwise you might not get what you wanted. Also, for a single Get, it seems (as Lars pointed out in

Re: secondary index feature

2014-01-03 Thread James Taylor
Hi Henning, Phoenix maintains a global index. It is essentially maintaining another HBase table for you with a different row key (and a subset of your data table columns that are covered). When an index is used by Phoenix, it is *exactly* like querying a data table (that's what Phoenix does - it

Re: secondary index feature

2014-01-03 Thread Henning Blohm
Hi James, this is a little embarassing... I even browsed through the code and read it as implementing a region level index. But now at least I get the restrictions mentioned for using the covered indexes. Thanks for clarifying. Guess I need to browse the code a little harder ;-) Henning

Re: Snappy compression question

2014-01-03 Thread Rural Hunter
The document is far from complete. It didn't mention the default hadoop binary package is compiled without snappy support and you need to compile it with snappy option yourself. Actually it didn't work with any native libs on 64 bits OS as the libhadoop.so in the binary package is only for 32

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Li Li
hi James, phoenix seems great but it's now only a experimental project. I want to use only hbase. could you tell me the difference of Phoenix and hbase? If I use hbase only, how should I design the schema and some extra things for my goal? thank you On Sat, Jan 4, 2014 at 3:41 AM, James

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor
Hi LiLi, Phoenix isn't an experimental project. We're on our 2.2 release, and many companies (including the company for which I'm employed, Salesforce.com) use it in production today. Thanks, James On Fri, Jan 3, 2014 at 11:39 PM, Li Li fancye...@gmail.com wrote: hi James, phoenix seems

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Li Li
so what's the relationship of Phoenix and HBase? something like hadoop and hive? On Sat, Jan 4, 2014 at 3:43 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Phoenix isn't an experimental project. We're on our 2.2 release, and many companies (including the company for which I'm