so what's the relationship of Phoenix and HBase? something like hadoop and hive?
On Sat, Jan 4, 2014 at 3:43 PM, James Taylor <jtay...@salesforce.com> wrote: > Hi LiLi, > Phoenix isn't an experimental project. We're on our 2.2 release, and many > companies (including the company for which I'm employed, Salesforce.com) > use it in production today. > Thanks, > James > > > On Fri, Jan 3, 2014 at 11:39 PM, Li Li <fancye...@gmail.com> wrote: > >> hi James, >> phoenix seems great but it's now only a experimental project. I >> want to use only hbase. could you tell me the difference of Phoenix >> and hbase? If I use hbase only, how should I design the schema and >> some extra things for my goal? thank you >> >> On Sat, Jan 4, 2014 at 3:41 AM, James Taylor <jtay...@salesforce.com> >> wrote: >> > On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika <asaf.mes...@gmail.com> >> wrote: >> > >> >> Couple of notes: >> >> 1. When updating to status you essentially add a new rowkey into HBase, >> I >> >> would give it up all together. The essential requirement seems to point >> at >> >> retrieving a list of urls in a certain order. >> >> >> > Not sure on this, but seemed to me that setting the status field is >> forcing >> > the urls that have been processed to be at the end of the sort order. >> > >> > 2. Wouldn't salting ruin the sort order required? Priority, date added? >> >> >> > No, as Phoenix maintains returning rows in row key order even when >> they're >> > salted. We do parallel scans for each bucket and do a merge sort on the >> > client, so the cost is pretty low for this (we also provide a way of >> > turning this off if your use case doesn't need it). >> > >> > Two years, JM? Now you're really going to have to start using Phoenix :-) >> > >> > >> >> On Friday, January 3, 2014, James Taylor wrote: >> >> >> >> > Sure, no problem. One addition: depending on the cardinality of your >> >> > priority column, you may want to salt your table to prevent >> hotspotting, >> >> > since you'll have a monotonically increasing date in the key. To do >> that, >> >> > just add " SALT_BUCKETS=<n>" on to your query, where <n> is the >> number of >> >> > machines in your cluster. You can read more about salting here: >> >> > http://phoenix.incubator.apache.org/salted.html >> >> > >> >> > >> >> > On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fancye...@gmail.com> wrote: >> >> > >> >> > > thank you. it's great. >> >> > > >> >> > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor < >> jtay...@salesforce.com> >> >> > > wrote: >> >> > > > Hi LiLi, >> >> > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). >> It's >> >> a >> >> > > SQL >> >> > > > skin on top of HBase. You can model your schema and issue your >> >> queries >> >> > > just >> >> > > > like you would with MySQL. Something like this: >> >> > > > >> >> > > > // Create table that optimizes for your most common query >> >> > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd >> want >> >> > your >> >> > > > rows ordered) >> >> > > > CREATE TABLE url_db ( >> >> > > > status TINYINT, >> >> > > > priority INTEGER NOT NULL, >> >> > > > added_time DATE, >> >> > > > url VARCHAR NOT NULL >> >> > > > CONSTRAINT pk PRIMARY KEY (status, priority, added_time, >> url)); >> >> > > > >> >> > > > int lastStatus = 0; >> >> > > > int lastPriority = 0; >> >> > > > Date lastAddedTime = new Date(0); >> >> > > > String lastUrl = ""; >> >> > > > >> >> > > > while (true) { >> >> > > > // Use row value constructor to page through results in >> batches >> >> of >> >> > > 1000 >> >> > > > String query = " >> >> > > > SELECT * FROM url_db >> >> > > > WHERE status=0 AND (status, priority, added_time, url) > >> (?, >> >> ?, >> >> > > ?, >> >> > > > ?) >> >> > > > ORDER BY status, priority, added_time, url >> >> > > > LIMIT 1000" >> >> > > > PreparedStatement stmt = connection.prepareStatement(query); >> >> > > > >> >> > > > // Bind parameters >> >> > > > stmt.setInt(1, lastStatus); >> >> > > > stmt.setInt(2, lastPriority); >> >> > > > stmt.setDate(3, lastAddedTime); >> >> > > > stmt.setString(4, lastUrl); >> >> > > > ResultSet resultSet = stmt.executeQuery(); >> >> > > > >> >> > > > while (resultSet.next()) { >> >> > > > // Remember last row processed so that you can start after >> >> that >> >> > > for >> >> > > > next batch >> >> > > > lastStatus = resultSet.getInt(1); >> >> > > > lastPriority = resultSet.getInt(2); >> >> > > > lastAddedTime = resultSet.getDate(3); >> >> > > > lastUrl = resultSet.getString(4); >> >> > > > >> >> > > > doSomethingWithUrls(); >> >> > > > >> >> > > > UPSERT INTO url_db(status, priority, added_time, url) >> >> > > > VALUES (1, ?, CURRENT_DATE(), ?); >> >> > > > >> >> > > > } >> >> > > > } >> >> > > > >> >> > > > If you need to efficiently query on url, add a secondary index >> like >> >> > this: >> >> > > > >> >> > > > CREATE INDEX url_index ON url_db (url); >> >> > > > >> >> > > > Please let me know if you have questions. >> >> > > > >> >> > > > Thanks, >> >> > > > James >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancye...@gmail.com> >> wrote: >> >> > > > >> >> > > >> thank you. But I can't use nutch. could you tell me how hbase is >> >> used >> >> > > >> in nutch? or hbase is only used to store webpage. >> >> > > >> >> >> > > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic >> >> > > >> <otis.gospodne...@gmail.com> wrote: >> >> > > >> > Hi, >> >> > > >> > >> >> > > >> > Have a look at http://nutch.apache.org . Version 2.x uses >> HBase >> >> > > under >> >> > > >> the >> >> > > >> > hood. >> >> > > >> > >> >> > > >> > Otis >> >> > > >> > -- >> >> > > >> > Performance Monitoring * Log Analytics * Search Analytics >> >> > > >> > Solr & Elasticsearch Support * http://sematext.com/ >> >> > > >> > >> >> > > >> > >> >> > > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li < >> >> >>