Re: use hbase as distributed crawl's scheduler

James Taylor Fri, 03 Jan 2014 00:27:32 -0800

Sure, no problem. One addition: depending on the cardinality of your
priority column, you may want to salt your table to prevent hotspotting,
since you'll have a monotonically increasing date in the key. To do that,
just add " SALT_BUCKETS=<n>" on to your query, where <n> is the number of
machines in your cluster. You can read more about salting here:
http://phoenix.incubator.apache.org/salted.html



On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fancye...@gmail.com> wrote:

> thank you. it's great.
>
> On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <jtay...@salesforce.com>
> wrote:
> > Hi LiLi,
> > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a
> SQL
> > skin on top of HBase. You can model your schema and issue your queries
> just
> > like you would with MySQL. Something like this:
> >
> > // Create table that optimizes for your most common query
> > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your
> > rows ordered)
> > CREATE TABLE url_db (
> >     status TINYINT,
> >     priority INTEGER NOT NULL,
> >     added_time DATE,
> >     url VARCHAR NOT NULL
> >     CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
> >
> > int lastStatus = 0;
> > int lastPriority = 0;
> > Date lastAddedTime = new Date(0);
> > String lastUrl = "";
> >
> > while (true) {
> >     // Use row value constructor to page through results in batches of
> 1000
> >     String query = "
> >         SELECT * FROM url_db
> >         WHERE status=0 AND (status, priority, added_time, url) > (?, ?,
> ?,
> > ?)
> >         ORDER BY status, priority, added_time, url
> >         LIMIT 1000"
> >     PreparedStatement stmt = connection.prepareStatement(query);
> >
> >     // Bind parameters
> >     stmt.setInt(1, lastStatus);
> >     stmt.setInt(2, lastPriority);
> >     stmt.setDate(3, lastAddedTime);
> >     stmt.setString(4, lastUrl);
> >     ResultSet resultSet = stmt.executeQuery();
> >
> >     while (resultSet.next()) {
> >         // Remember last row processed so that you can start after that
> for
> > next batch
> >         lastStatus = resultSet.getInt(1);
> >         lastPriority = resultSet.getInt(2);
> >         lastAddedTime = resultSet.getDate(3);
> >         lastUrl = resultSet.getString(4);
> >
> >         doSomethingWithUrls();
> >
> >         UPSERT INTO url_db(status, priority, added_time, url)
> >         VALUES (1, ?, CURRENT_DATE(), ?);
> >
> >     }
> > }
> >
> > If you need to efficiently query on url, add a secondary index like this:
> >
> > CREATE INDEX url_index ON url_db (url);
> >
> > Please let me know if you have questions.
> >
> > Thanks,
> > James
> >
> >
> >
> >
> > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancye...@gmail.com> wrote:
> >
> >> thank you. But I can't use nutch. could you tell me how hbase is used
> >> in nutch? or hbase is only used to store webpage.
> >>
> >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
> >> <otis.gospodne...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > Have a look at http://nutch.apache.org .  Version 2.x uses HBase
> under
> >> the
> >> > hood.
> >> >
> >> > Otis
> >> > --
> >> > Performance Monitoring * Log Analytics * Search Analytics
> >> > Solr & Elasticsearch Support * http://sematext.com/
> >> >
> >> >
> >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fancye...@gmail.com> wrote:
> >> >
> >> >> hi all,
> >> >>      I want to use hbase to store all urls(crawled or not crawled).
> >> >> And each url will has a column named priority which represent the
> >> >> priority of the url. I want to get the top N urls order by
> priority(if
> >> >> priority is the same then url whose timestamp is ealier is prefered).
> >> >>      in using something like mysql, my client application may like:
> >> >>      while true:
> >> >>          select  url from url_db order by priority,addedTime limit
> >> >> 1000 where status='not_crawled';
> >> >>          do something with this urls;
> >> >>          extract more urls and insert them into url_db;
> >> >>      How should I design hbase schema for this application? Is hbase
> >> >> suitable for me?
> >> >>      I found in this article
> >> >>
> >>
> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
> >> >> ,
> >> >> they use redis to store urls. I think hbase is originated from
> >> >> bigtable and google use bigtable to store webpage, so for huge number
> >> >> of urls, I prefer distributed system like hbase.
> >> >>
> >>
>

Re: use hbase as distributed crawl's scheduler

Reply via email to