hi James, phoenix seems great but it's now only a experimental project. I want to use only hbase. could you tell me the difference of Phoenix and hbase? If I use hbase only, how should I design the schema and some extra things for my goal? thank you
On Sat, Jan 4, 2014 at 3:41 AM, James Taylor <jtay...@salesforce.com> wrote: > On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika <asaf.mes...@gmail.com> wrote: > >> Couple of notes: >> 1. When updating to status you essentially add a new rowkey into HBase, I >> would give it up all together. The essential requirement seems to point at >> retrieving a list of urls in a certain order. >> > Not sure on this, but seemed to me that setting the status field is forcing > the urls that have been processed to be at the end of the sort order. > > 2. Wouldn't salting ruin the sort order required? Priority, date added? >> > No, as Phoenix maintains returning rows in row key order even when they're > salted. We do parallel scans for each bucket and do a merge sort on the > client, so the cost is pretty low for this (we also provide a way of > turning this off if your use case doesn't need it). > > Two years, JM? Now you're really going to have to start using Phoenix :-) > > >> On Friday, January 3, 2014, James Taylor wrote: >> >> > Sure, no problem. One addition: depending on the cardinality of your >> > priority column, you may want to salt your table to prevent hotspotting, >> > since you'll have a monotonically increasing date in the key. To do that, >> > just add " SALT_BUCKETS=<n>" on to your query, where <n> is the number of >> > machines in your cluster. You can read more about salting here: >> > http://phoenix.incubator.apache.org/salted.html >> > >> > >> > On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fancye...@gmail.com> wrote: >> > >> > > thank you. it's great. >> > > >> > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <jtay...@salesforce.com> >> > > wrote: >> > > > Hi LiLi, >> > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's >> a >> > > SQL >> > > > skin on top of HBase. You can model your schema and issue your >> queries >> > > just >> > > > like you would with MySQL. Something like this: >> > > > >> > > > // Create table that optimizes for your most common query >> > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want >> > your >> > > > rows ordered) >> > > > CREATE TABLE url_db ( >> > > > status TINYINT, >> > > > priority INTEGER NOT NULL, >> > > > added_time DATE, >> > > > url VARCHAR NOT NULL >> > > > CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); >> > > > >> > > > int lastStatus = 0; >> > > > int lastPriority = 0; >> > > > Date lastAddedTime = new Date(0); >> > > > String lastUrl = ""; >> > > > >> > > > while (true) { >> > > > // Use row value constructor to page through results in batches >> of >> > > 1000 >> > > > String query = " >> > > > SELECT * FROM url_db >> > > > WHERE status=0 AND (status, priority, added_time, url) > (?, >> ?, >> > > ?, >> > > > ?) >> > > > ORDER BY status, priority, added_time, url >> > > > LIMIT 1000" >> > > > PreparedStatement stmt = connection.prepareStatement(query); >> > > > >> > > > // Bind parameters >> > > > stmt.setInt(1, lastStatus); >> > > > stmt.setInt(2, lastPriority); >> > > > stmt.setDate(3, lastAddedTime); >> > > > stmt.setString(4, lastUrl); >> > > > ResultSet resultSet = stmt.executeQuery(); >> > > > >> > > > while (resultSet.next()) { >> > > > // Remember last row processed so that you can start after >> that >> > > for >> > > > next batch >> > > > lastStatus = resultSet.getInt(1); >> > > > lastPriority = resultSet.getInt(2); >> > > > lastAddedTime = resultSet.getDate(3); >> > > > lastUrl = resultSet.getString(4); >> > > > >> > > > doSomethingWithUrls(); >> > > > >> > > > UPSERT INTO url_db(status, priority, added_time, url) >> > > > VALUES (1, ?, CURRENT_DATE(), ?); >> > > > >> > > > } >> > > > } >> > > > >> > > > If you need to efficiently query on url, add a secondary index like >> > this: >> > > > >> > > > CREATE INDEX url_index ON url_db (url); >> > > > >> > > > Please let me know if you have questions. >> > > > >> > > > Thanks, >> > > > James >> > > > >> > > > >> > > > >> > > > >> > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancye...@gmail.com> wrote: >> > > > >> > > >> thank you. But I can't use nutch. could you tell me how hbase is >> used >> > > >> in nutch? or hbase is only used to store webpage. >> > > >> >> > > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic >> > > >> <otis.gospodne...@gmail.com> wrote: >> > > >> > Hi, >> > > >> > >> > > >> > Have a look at http://nutch.apache.org . Version 2.x uses HBase >> > > under >> > > >> the >> > > >> > hood. >> > > >> > >> > > >> > Otis >> > > >> > -- >> > > >> > Performance Monitoring * Log Analytics * Search Analytics >> > > >> > Solr & Elasticsearch Support * http://sematext.com/ >> > > >> > >> > > >> > >> > > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li < >>