Re: use hbase as distributed crawl's scheduler

Li Li Fri, 03 Jan 2014 23:48:09 -0800

so what's the relationship of Phoenix and HBase? something like hadoop and hive?



On Sat, Jan 4, 2014 at 3:43 PM, James Taylor <jtay...@salesforce.com> wrote:
> Hi LiLi,
> Phoenix isn't an experimental project. We're on our 2.2 release, and many
> companies (including the company for which I'm employed, Salesforce.com)
> use it in production today.
> Thanks,
> James
>
>
> On Fri, Jan 3, 2014 at 11:39 PM, Li Li <fancye...@gmail.com> wrote:
>
>> hi James,
>>     phoenix seems great but it's now only a experimental project. I
>> want to use only hbase. could you tell me the difference of Phoenix
>> and hbase? If I use hbase only, how should I design the schema and
>> some extra things for my goal? thank you
>>
>> On Sat, Jan 4, 2014 at 3:41 AM, James Taylor <jtay...@salesforce.com>
>> wrote:
>> > On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika <asaf.mes...@gmail.com>
>> wrote:
>> >
>> >> Couple of notes:
>> >> 1. When updating to status you essentially add a new rowkey into HBase,
>> I
>> >> would give it up all together. The essential requirement seems to point
>> at
>> >> retrieving a list of urls in a certain order.
>> >>
>> > Not sure on this, but seemed to me that setting the status field is
>> forcing
>> > the urls that have been processed to be at the end of the sort order.
>> >
>> > 2. Wouldn't salting ruin the sort order required? Priority, date added?
>> >>
>> > No, as Phoenix maintains returning rows in row key order even when
>> they're
>> > salted. We do parallel scans for each bucket and do a merge sort on the
>> > client, so the cost is pretty low for this (we also provide a way of
>> > turning this off if your use case doesn't need it).
>> >
>> > Two years, JM? Now you're really going to have to start using Phoenix :-)
>> >
>> >
>> >> On Friday, January 3, 2014, James Taylor wrote:
>> >>
>> >> > Sure, no problem. One addition: depending on the cardinality of your
>> >> > priority column, you may want to salt your table to prevent
>> hotspotting,
>> >> > since you'll have a monotonically increasing date in the key. To do
>> that,
>> >> > just add " SALT_BUCKETS=<n>" on to your query, where <n> is the
>> number of
>> >> > machines in your cluster. You can read more about salting here:
>> >> > http://phoenix.incubator.apache.org/salted.html
>> >> >
>> >> >
>> >> > On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fancye...@gmail.com> wrote:
>> >> >
>> >> > > thank you. it's great.
>> >> > >
>> >> > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <
>> jtay...@salesforce.com>
>> >> > > wrote:
>> >> > > > Hi LiLi,
>> >> > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/).
>> It's
>> >> a
>> >> > > SQL
>> >> > > > skin on top of HBase. You can model your schema and issue your
>> >> queries
>> >> > > just
>> >> > > > like you would with MySQL. Something like this:
>> >> > > >
>> >> > > > // Create table that optimizes for your most common query
>> >> > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd
>> want
>> >> > your
>> >> > > > rows ordered)
>> >> > > > CREATE TABLE url_db (
>> >> > > >     status TINYINT,
>> >> > > >     priority INTEGER NOT NULL,
>> >> > > >     added_time DATE,
>> >> > > >     url VARCHAR NOT NULL
>> >> > > >     CONSTRAINT pk PRIMARY KEY (status, priority, added_time,
>> url));
>> >> > > >
>> >> > > > int lastStatus = 0;
>> >> > > > int lastPriority = 0;
>> >> > > > Date lastAddedTime = new Date(0);
>> >> > > > String lastUrl = "";
>> >> > > >
>> >> > > > while (true) {
>> >> > > >     // Use row value constructor to page through results in
>> batches
>> >> of
>> >> > > 1000
>> >> > > >     String query = "
>> >> > > >         SELECT * FROM url_db
>> >> > > >         WHERE status=0 AND (status, priority, added_time, url) >
>> (?,
>> >> ?,
>> >> > > ?,
>> >> > > > ?)
>> >> > > >         ORDER BY status, priority, added_time, url
>> >> > > >         LIMIT 1000"
>> >> > > >     PreparedStatement stmt = connection.prepareStatement(query);
>> >> > > >
>> >> > > >     // Bind parameters
>> >> > > >     stmt.setInt(1, lastStatus);
>> >> > > >     stmt.setInt(2, lastPriority);
>> >> > > >     stmt.setDate(3, lastAddedTime);
>> >> > > >     stmt.setString(4, lastUrl);
>> >> > > >     ResultSet resultSet = stmt.executeQuery();
>> >> > > >
>> >> > > >     while (resultSet.next()) {
>> >> > > >         // Remember last row processed so that you can start after
>> >> that
>> >> > > for
>> >> > > > next batch
>> >> > > >         lastStatus = resultSet.getInt(1);
>> >> > > >         lastPriority = resultSet.getInt(2);
>> >> > > >         lastAddedTime = resultSet.getDate(3);
>> >> > > >         lastUrl = resultSet.getString(4);
>> >> > > >
>> >> > > >         doSomethingWithUrls();
>> >> > > >
>> >> > > >         UPSERT INTO url_db(status, priority, added_time, url)
>> >> > > >         VALUES (1, ?, CURRENT_DATE(), ?);
>> >> > > >
>> >> > > >     }
>> >> > > > }
>> >> > > >
>> >> > > > If you need to efficiently query on url, add a secondary index
>> like
>> >> > this:
>> >> > > >
>> >> > > > CREATE INDEX url_index ON url_db (url);
>> >> > > >
>> >> > > > Please let me know if you have questions.
>> >> > > >
>> >> > > > Thanks,
>> >> > > > James
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancye...@gmail.com>
>> wrote:
>> >> > > >
>> >> > > >> thank you. But I can't use nutch. could you tell me how hbase is
>> >> used
>> >> > > >> in nutch? or hbase is only used to store webpage.
>> >> > > >>
>> >> > > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
>> >> > > >> <otis.gospodne...@gmail.com> wrote:
>> >> > > >> > Hi,
>> >> > > >> >
>> >> > > >> > Have a look at http://nutch.apache.org .  Version 2.x uses
>> HBase
>> >> > > under
>> >> > > >> the
>> >> > > >> > hood.
>> >> > > >> >
>> >> > > >> > Otis
>> >> > > >> > --
>> >> > > >> > Performance Monitoring * Log Analytics * Search Analytics
>> >> > > >> > Solr & Elasticsearch Support * http://sematext.com/
>> >> > > >> >
>> >> > > >> >
>> >> > > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <
>> >>
>>

Re: use hbase as distributed crawl's scheduler

Reply via email to