Re: use hbase as distributed crawl's scheduler

Li Li Thu, 02 Jan 2014 22:24:26 -0800

thank you. But I can't use nutch. could you tell me how hbase is used
in nutch? or hbase is only used to store webpage.


On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
<otis.gospodne...@gmail.com> wrote:
> Hi,
>
> Have a look at http://nutch.apache.org .  Version 2.x uses HBase under the
> hood.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fancye...@gmail.com> wrote:
>
>> hi all,
>>      I want to use hbase to store all urls(crawled or not crawled).
>> And each url will has a column named priority which represent the
>> priority of the url. I want to get the top N urls order by priority(if
>> priority is the same then url whose timestamp is ealier is prefered).
>>      in using something like mysql, my client application may like:
>>      while true:
>>          select  url from url_db order by priority,addedTime limit
>> 1000 where status='not_crawled';
>>          do something with this urls;
>>          extract more urls and insert them into url_db;
>>      How should I design hbase schema for this application? Is hbase
>> suitable for me?
>>      I found in this article
>> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
>> ,
>> they use redis to store urls. I think hbase is originated from
>> bigtable and google use bigtable to store webpage, so for huge number
>> of urls, I prefer distributed system like hbase.
>>

Re: use hbase as distributed crawl's scheduler

Reply via email to