Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood.
Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fancye...@gmail.com> wrote: > hi all, > I want to use hbase to store all urls(crawled or not crawled). > And each url will has a column named priority which represent the > priority of the url. I want to get the top N urls order by priority(if > priority is the same then url whose timestamp is ealier is prefered). > in using something like mysql, my client application may like: > while true: > select url from url_db order by priority,addedTime limit > 1000 where status='not_crawled'; > do something with this urls; > extract more urls and insert them into url_db; > How should I design hbase schema for this application? Is hbase > suitable for me? > I found in this article > http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ > , > they use redis to store urls. I think hbase is originated from > bigtable and google use bigtable to store webpage, so for huge number > of urls, I prefer distributed system like hbase. >