thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage.
On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic <otis.gospodne...@gmail.com> wrote: > Hi, > > Have a look at http://nutch.apache.org . Version 2.x uses HBase under the > hood. > > Otis > -- > Performance Monitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fancye...@gmail.com> wrote: > >> hi all, >> I want to use hbase to store all urls(crawled or not crawled). >> And each url will has a column named priority which represent the >> priority of the url. I want to get the top N urls order by priority(if >> priority is the same then url whose timestamp is ealier is prefered). >> in using something like mysql, my client application may like: >> while true: >> select url from url_db order by priority,addedTime limit >> 1000 where status='not_crawled'; >> do something with this urls; >> extract more urls and insert them into url_db; >> How should I design hbase schema for this application? Is hbase >> suitable for me? >> I found in this article >> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ >> , >> they use redis to store urls. I think hbase is originated from >> bigtable and google use bigtable to store webpage, so for huge number >> of urls, I prefer distributed system like hbase. >>