Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. 2. Wouldn't salting ruin the sort order required? Priority, date added?
On Friday, January 3, 2014, James Taylor wrote: > Sure, no problem. One addition: depending on the cardinality of your > priority column, you may want to salt your table to prevent hotspotting, > since you'll have a monotonically increasing date in the key. To do that, > just add " SALT_BUCKETS=<n>" on to your query, where <n> is the number of > machines in your cluster. You can read more about salting here: > http://phoenix.incubator.apache.org/salted.html > > > On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fancye...@gmail.com> wrote: > > > thank you. it's great. > > > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <jtay...@salesforce.com> > > wrote: > > > Hi LiLi, > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a > > SQL > > > skin on top of HBase. You can model your schema and issue your queries > > just > > > like you would with MySQL. Something like this: > > > > > > // Create table that optimizes for your most common query > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want > your > > > rows ordered) > > > CREATE TABLE url_db ( > > > status TINYINT, > > > priority INTEGER NOT NULL, > > > added_time DATE, > > > url VARCHAR NOT NULL > > > CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); > > > > > > int lastStatus = 0; > > > int lastPriority = 0; > > > Date lastAddedTime = new Date(0); > > > String lastUrl = ""; > > > > > > while (true) { > > > // Use row value constructor to page through results in batches of > > 1000 > > > String query = " > > > SELECT * FROM url_db > > > WHERE status=0 AND (status, priority, added_time, url) > (?, ?, > > ?, > > > ?) > > > ORDER BY status, priority, added_time, url > > > LIMIT 1000" > > > PreparedStatement stmt = connection.prepareStatement(query); > > > > > > // Bind parameters > > > stmt.setInt(1, lastStatus); > > > stmt.setInt(2, lastPriority); > > > stmt.setDate(3, lastAddedTime); > > > stmt.setString(4, lastUrl); > > > ResultSet resultSet = stmt.executeQuery(); > > > > > > while (resultSet.next()) { > > > // Remember last row processed so that you can start after that > > for > > > next batch > > > lastStatus = resultSet.getInt(1); > > > lastPriority = resultSet.getInt(2); > > > lastAddedTime = resultSet.getDate(3); > > > lastUrl = resultSet.getString(4); > > > > > > doSomethingWithUrls(); > > > > > > UPSERT INTO url_db(status, priority, added_time, url) > > > VALUES (1, ?, CURRENT_DATE(), ?); > > > > > > } > > > } > > > > > > If you need to efficiently query on url, add a secondary index like > this: > > > > > > CREATE INDEX url_index ON url_db (url); > > > > > > Please let me know if you have questions. > > > > > > Thanks, > > > James > > > > > > > > > > > > > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancye...@gmail.com> wrote: > > > > > >> thank you. But I can't use nutch. could you tell me how hbase is used > > >> in nutch? or hbase is only used to store webpage. > > >> > > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic > > >> <otis.gospodne...@gmail.com> wrote: > > >> > Hi, > > >> > > > >> > Have a look at http://nutch.apache.org . Version 2.x uses HBase > > under > > >> the > > >> > hood. > > >> > > > >> > Otis > > >> > -- > > >> > Performance Monitoring * Log Analytics * Search Analytics > > >> > Solr & Elasticsearch Support * http://sematext.com/ > > >> > > > >> > > > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <