Re: use hbase as distributed crawl's scheduler

2014-01-04 Thread James Taylor
Please take a look at our Apache incubator proposal, as I think that may answer your questions: https://wiki.apache.org/incubator/PhoenixProposal On Fri, Jan 3, 2014 at 11:47 PM, Li Li fancye...@gmail.com wrote: so what's the relationship of Phoenix and HBase? something like hadoop and hive?

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor
Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Jean-Marc Spaggiari
Interesting. This is exactly what I'm doing ;) I'm using 3 tables to achieve this. One table with the URL already crawled (80 millions), one URL with the URL to crawle (2 billions) and one URL with the URLs been processed. I'm not running any SQL requests against my dataset but I have MR jobs

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Ted Yu
bq. One URL ... I guess you mean one table ... Cheers On Jan 3, 2014, at 4:19 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Interesting. This is exactly what I'm doing ;) I'm using 3 tables to achieve this. One table with the URL already crawled (80 millions), one URL with the

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Jean-Marc Spaggiari
Yes, sorry ;) Thanks for the correction. Should have been: One table with the URL already crawled (80 millions), one table with the URL to crawle (2 billions) and one table with the URLs been processed. I'm not running any SQL requests against my dataset but I have MR jobs doing many different

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Asaf Mesika
Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. 2. Wouldn't salting ruin the sort order required? Priority, date added? On Friday,

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor
On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com wrote: Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. Not

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Li Li
hi James, phoenix seems great but it's now only a experimental project. I want to use only hbase. could you tell me the difference of Phoenix and hbase? If I use hbase only, how should I design the schema and some extra things for my goal? thank you On Sat, Jan 4, 2014 at 3:41 AM, James

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor
Hi LiLi, Phoenix isn't an experimental project. We're on our 2.2 release, and many companies (including the company for which I'm employed, Salesforce.com) use it in production today. Thanks, James On Fri, Jan 3, 2014 at 11:39 PM, Li Li fancye...@gmail.com wrote: hi James, phoenix seems

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Li Li
so what's the relationship of Phoenix and HBase? something like hadoop and hive? On Sat, Jan 4, 2014 at 3:43 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Phoenix isn't an experimental project. We're on our 2.2 release, and many companies (including the company for which I'm

use hbase as distributed crawl's scheduler

2014-01-02 Thread Li Li
hi all, I want to use hbase to store all urls(crawled or not crawled). And each url will has a column named priority which represent the priority of the url. I want to get the top N urls order by priority(if priority is the same then url whose timestamp is ealier is prefered). in using

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread Otis Gospodnetic
Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote: hi all, I want to

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread Li Li
thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood.

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread James Taylor
Otis, I didn't realize Nutch uses HBase underneath. Might be interesting if you serialized data in a Phoenix-compliant manner, as you could run SQL queries directly on top of it. Thanks, James On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread Otis Gospodnetic
Hi, Yes. I'm sure that would be a welcome addition. Topic for user@nutch.a.o... Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:23 AM, James Taylor jtay...@salesforce.com wrote: Otis, I didn't

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread James Taylor
Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread Li Li
thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something