use hbase as distributed crawl's scheduler

Li Li Thu, 02 Jan 2014 22:14:26 -0800

hi all,
     I want to use hbase to store all urls(crawled or not crawled).
And each url will has a column named priority which represent the
priority of the url. I want to get the top N urls order by priority(if
priority is the same then url whose timestamp is ealier is prefered).
     in using something like mysql, my client application may like:
     while true:
         select  url from url_db order by priority,addedTime limit
1000 where status='not_crawled';
         do something with this urls;
         extract more urls and insert them into url_db;
     How should I design hbase schema for this application? Is hbase
suitable for me?
     I found in this article
http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/,
they use redis to store urls. I think hbase is originated from
bigtable and google use bigtable to store webpage, so for huge number
of urls, I prefer distributed system like hbase.

use hbase as distributed crawl's scheduler

Reply via email to