Re: use hbase as distributed crawl's scheduler

Asaf Mesika Fri, 03 Jan 2014 10:51:17 -0800

Couple of notes:
1. When updating to status you essentially add a new rowkey into HBase, I
would give it up all together. The essential requirement seems to point at
retrieving a list of urls in a certain order.
2. Wouldn't salting ruin the sort order required? Priority, date added?


On Friday, January 3, 2014, James Taylor wrote:

> Sure, no problem. One addition: depending on the cardinality of your
> priority column, you may want to salt your table to prevent hotspotting,
> since you'll have a monotonically increasing date in the key. To do that,
> just add " SALT_BUCKETS=<n>" on to your query, where <n> is the number of
> machines in your cluster. You can read more about salting here:
> http://phoenix.incubator.apache.org/salted.html
>
>
> On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fancye...@gmail.com> wrote:
>
> > thank you. it's great.
> >
> > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <jtay...@salesforce.com>
> > wrote:
> > > Hi LiLi,
> > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a
> > SQL
> > > skin on top of HBase. You can model your schema and issue your queries
> > just
> > > like you would with MySQL. Something like this:
> > >
> > > // Create table that optimizes for your most common query
> > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want
> your
> > > rows ordered)
> > > CREATE TABLE url_db (
> > >     status TINYINT,
> > >     priority INTEGER NOT NULL,
> > >     added_time DATE,
> > >     url VARCHAR NOT NULL
> > >     CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
> > >
> > > int lastStatus = 0;
> > > int lastPriority = 0;
> > > Date lastAddedTime = new Date(0);
> > > String lastUrl = "";
> > >
> > > while (true) {
> > >     // Use row value constructor to page through results in batches of
> > 1000
> > >     String query = "
> > >         SELECT * FROM url_db
> > >         WHERE status=0 AND (status, priority, added_time, url) > (?, ?,
> > ?,
> > > ?)
> > >         ORDER BY status, priority, added_time, url
> > >         LIMIT 1000"
> > >     PreparedStatement stmt = connection.prepareStatement(query);
> > >
> > >     // Bind parameters
> > >     stmt.setInt(1, lastStatus);
> > >     stmt.setInt(2, lastPriority);
> > >     stmt.setDate(3, lastAddedTime);
> > >     stmt.setString(4, lastUrl);
> > >     ResultSet resultSet = stmt.executeQuery();
> > >
> > >     while (resultSet.next()) {
> > >         // Remember last row processed so that you can start after that
> > for
> > > next batch
> > >         lastStatus = resultSet.getInt(1);
> > >         lastPriority = resultSet.getInt(2);
> > >         lastAddedTime = resultSet.getDate(3);
> > >         lastUrl = resultSet.getString(4);
> > >
> > >         doSomethingWithUrls();
> > >
> > >         UPSERT INTO url_db(status, priority, added_time, url)
> > >         VALUES (1, ?, CURRENT_DATE(), ?);
> > >
> > >     }
> > > }
> > >
> > > If you need to efficiently query on url, add a secondary index like
> this:
> > >
> > > CREATE INDEX url_index ON url_db (url);
> > >
> > > Please let me know if you have questions.
> > >
> > > Thanks,
> > > James
> > >
> > >
> > >
> > >
> > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancye...@gmail.com> wrote:
> > >
> > >> thank you. But I can't use nutch. could you tell me how hbase is used
> > >> in nutch? or hbase is only used to store webpage.
> > >>
> > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
> > >> <otis.gospodne...@gmail.com> wrote:
> > >> > Hi,
> > >> >
> > >> > Have a look at http://nutch.apache.org .  Version 2.x uses HBase
> > under
> > >> the
> > >> > hood.
> > >> >
> > >> > Otis
> > >> > --
> > >> > Performance Monitoring * Log Analytics * Search Analytics
> > >> > Solr & Elasticsearch Support * http://sematext.com/
> > >> >
> > >> >
> > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <

Re: use hbase as distributed crawl's scheduler

Reply via email to