Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add " SALT_BUCKETS=<n>" on to your query, where <n> is the number of machines in your cluster. You can read more about salting here: http://phoenix.incubator.apache.org/salted.html
On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fancye...@gmail.com> wrote: > thank you. it's great. > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <jtay...@salesforce.com> > wrote: > > Hi LiLi, > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a > SQL > > skin on top of HBase. You can model your schema and issue your queries > just > > like you would with MySQL. Something like this: > > > > // Create table that optimizes for your most common query > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your > > rows ordered) > > CREATE TABLE url_db ( > > status TINYINT, > > priority INTEGER NOT NULL, > > added_time DATE, > > url VARCHAR NOT NULL > > CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); > > > > int lastStatus = 0; > > int lastPriority = 0; > > Date lastAddedTime = new Date(0); > > String lastUrl = ""; > > > > while (true) { > > // Use row value constructor to page through results in batches of > 1000 > > String query = " > > SELECT * FROM url_db > > WHERE status=0 AND (status, priority, added_time, url) > (?, ?, > ?, > > ?) > > ORDER BY status, priority, added_time, url > > LIMIT 1000" > > PreparedStatement stmt = connection.prepareStatement(query); > > > > // Bind parameters > > stmt.setInt(1, lastStatus); > > stmt.setInt(2, lastPriority); > > stmt.setDate(3, lastAddedTime); > > stmt.setString(4, lastUrl); > > ResultSet resultSet = stmt.executeQuery(); > > > > while (resultSet.next()) { > > // Remember last row processed so that you can start after that > for > > next batch > > lastStatus = resultSet.getInt(1); > > lastPriority = resultSet.getInt(2); > > lastAddedTime = resultSet.getDate(3); > > lastUrl = resultSet.getString(4); > > > > doSomethingWithUrls(); > > > > UPSERT INTO url_db(status, priority, added_time, url) > > VALUES (1, ?, CURRENT_DATE(), ?); > > > > } > > } > > > > If you need to efficiently query on url, add a secondary index like this: > > > > CREATE INDEX url_index ON url_db (url); > > > > Please let me know if you have questions. > > > > Thanks, > > James > > > > > > > > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancye...@gmail.com> wrote: > > > >> thank you. But I can't use nutch. could you tell me how hbase is used > >> in nutch? or hbase is only used to store webpage. > >> > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic > >> <otis.gospodne...@gmail.com> wrote: > >> > Hi, > >> > > >> > Have a look at http://nutch.apache.org . Version 2.x uses HBase > under > >> the > >> > hood. > >> > > >> > Otis > >> > -- > >> > Performance Monitoring * Log Analytics * Search Analytics > >> > Solr & Elasticsearch Support * http://sematext.com/ > >> > > >> > > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fancye...@gmail.com> wrote: > >> > > >> >> hi all, > >> >> I want to use hbase to store all urls(crawled or not crawled). > >> >> And each url will has a column named priority which represent the > >> >> priority of the url. I want to get the top N urls order by > priority(if > >> >> priority is the same then url whose timestamp is ealier is prefered). > >> >> in using something like mysql, my client application may like: > >> >> while true: > >> >> select url from url_db order by priority,addedTime limit > >> >> 1000 where status='not_crawled'; > >> >> do something with this urls; > >> >> extract more urls and insert them into url_db; > >> >> How should I design hbase schema for this application? Is hbase > >> >> suitable for me? > >> >> I found in this article > >> >> > >> > http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ > >> >> , > >> >> they use redis to store urls. I think hbase is originated from > >> >> bigtable and google use bigtable to store webpage, so for huge number > >> >> of urls, I prefer distributed system like hbase. > >> >> > >> >