Re: use hbase as distributed crawl's scheduler

Li Li Fri, 03 Jan 2014 23:41:09 -0800

hi James,
    phoenix seems great but it's now only a experimental project. I
want to use only hbase. could you tell me the difference of Phoenix
and hbase? If I use hbase only, how should I design the schema and
some extra things for my goal? thank you


On Sat, Jan 4, 2014 at 3:41 AM, James Taylor <jtay...@salesforce.com> wrote:
> On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika <asaf.mes...@gmail.com> wrote:
>
>> Couple of notes:
>> 1. When updating to status you essentially add a new rowkey into HBase, I
>> would give it up all together. The essential requirement seems to point at
>> retrieving a list of urls in a certain order.
>>
> Not sure on this, but seemed to me that setting the status field is forcing
> the urls that have been processed to be at the end of the sort order.
>
> 2. Wouldn't salting ruin the sort order required? Priority, date added?
>>
> No, as Phoenix maintains returning rows in row key order even when they're
> salted. We do parallel scans for each bucket and do a merge sort on the
> client, so the cost is pretty low for this (we also provide a way of
> turning this off if your use case doesn't need it).
>
> Two years, JM? Now you're really going to have to start using Phoenix :-)
>
>
>> On Friday, January 3, 2014, James Taylor wrote:
>>
>> > Sure, no problem. One addition: depending on the cardinality of your
>> > priority column, you may want to salt your table to prevent hotspotting,
>> > since you'll have a monotonically increasing date in the key. To do that,
>> > just add " SALT_BUCKETS=<n>" on to your query, where <n> is the number of
>> > machines in your cluster. You can read more about salting here:
>> > http://phoenix.incubator.apache.org/salted.html
>> >
>> >
>> > On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fancye...@gmail.com> wrote:
>> >
>> > > thank you. it's great.
>> > >
>> > > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <jtay...@salesforce.com>
>> > > wrote:
>> > > > Hi LiLi,
>> > > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's
>> a
>> > > SQL
>> > > > skin on top of HBase. You can model your schema and issue your
>> queries
>> > > just
>> > > > like you would with MySQL. Something like this:
>> > > >
>> > > > // Create table that optimizes for your most common query
>> > > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want
>> > your
>> > > > rows ordered)
>> > > > CREATE TABLE url_db (
>> > > >     status TINYINT,
>> > > >     priority INTEGER NOT NULL,
>> > > >     added_time DATE,
>> > > >     url VARCHAR NOT NULL
>> > > >     CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
>> > > >
>> > > > int lastStatus = 0;
>> > > > int lastPriority = 0;
>> > > > Date lastAddedTime = new Date(0);
>> > > > String lastUrl = "";
>> > > >
>> > > > while (true) {
>> > > >     // Use row value constructor to page through results in batches
>> of
>> > > 1000
>> > > >     String query = "
>> > > >         SELECT * FROM url_db
>> > > >         WHERE status=0 AND (status, priority, added_time, url) > (?,
>> ?,
>> > > ?,
>> > > > ?)
>> > > >         ORDER BY status, priority, added_time, url
>> > > >         LIMIT 1000"
>> > > >     PreparedStatement stmt = connection.prepareStatement(query);
>> > > >
>> > > >     // Bind parameters
>> > > >     stmt.setInt(1, lastStatus);
>> > > >     stmt.setInt(2, lastPriority);
>> > > >     stmt.setDate(3, lastAddedTime);
>> > > >     stmt.setString(4, lastUrl);
>> > > >     ResultSet resultSet = stmt.executeQuery();
>> > > >
>> > > >     while (resultSet.next()) {
>> > > >         // Remember last row processed so that you can start after
>> that
>> > > for
>> > > > next batch
>> > > >         lastStatus = resultSet.getInt(1);
>> > > >         lastPriority = resultSet.getInt(2);
>> > > >         lastAddedTime = resultSet.getDate(3);
>> > > >         lastUrl = resultSet.getString(4);
>> > > >
>> > > >         doSomethingWithUrls();
>> > > >
>> > > >         UPSERT INTO url_db(status, priority, added_time, url)
>> > > >         VALUES (1, ?, CURRENT_DATE(), ?);
>> > > >
>> > > >     }
>> > > > }
>> > > >
>> > > > If you need to efficiently query on url, add a secondary index like
>> > this:
>> > > >
>> > > > CREATE INDEX url_index ON url_db (url);
>> > > >
>> > > > Please let me know if you have questions.
>> > > >
>> > > > Thanks,
>> > > > James
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancye...@gmail.com> wrote:
>> > > >
>> > > >> thank you. But I can't use nutch. could you tell me how hbase is
>> used
>> > > >> in nutch? or hbase is only used to store webpage.
>> > > >>
>> > > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
>> > > >> <otis.gospodne...@gmail.com> wrote:
>> > > >> > Hi,
>> > > >> >
>> > > >> > Have a look at http://nutch.apache.org .  Version 2.x uses HBase
>> > > under
>> > > >> the
>> > > >> > hood.
>> > > >> >
>> > > >> > Otis
>> > > >> > --
>> > > >> > Performance Monitoring * Log Analytics * Search Analytics
>> > > >> > Solr & Elasticsearch Support * http://sematext.com/
>> > > >> >
>> > > >> >
>> > > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <
>>

Re: use hbase as distributed crawl's scheduler

Reply via email to