Re: use hbase as distributed crawl's scheduler

Otis Gospodnetic Thu, 02 Jan 2014 22:19:26 -0800

Hi,

Have a look at http://nutch.apache.org .  Version 2.x uses HBase under the
hood.


Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fancye...@gmail.com> wrote:

> hi all,
>      I want to use hbase to store all urls(crawled or not crawled).
> And each url will has a column named priority which represent the
> priority of the url. I want to get the top N urls order by priority(if
> priority is the same then url whose timestamp is ealier is prefered).
>      in using something like mysql, my client application may like:
>      while true:
>          select  url from url_db order by priority,addedTime limit
> 1000 where status='not_crawled';
>          do something with this urls;
>          extract more urls and insert them into url_db;
>      How should I design hbase schema for this application? Is hbase
> suitable for me?
>      I found in this article
> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
> ,
> they use redis to store urls. I think hbase is originated from
> bigtable and google use bigtable to store webpage, so for huge number
> of urls, I prefer distributed system like hbase.
>

Re: use hbase as distributed crawl's scheduler

Reply via email to