Re: use hbase as distributed crawl's scheduler

Ted Yu Fri, 03 Jan 2014 05:58:13 -0800

bq. One URL ...

I guess you mean one table ...


Cheers

On Jan 3, 2014, at 4:19 AM, Jean-Marc Spaggiari <jean-m...@spaggiari.org> wrote:

> Interesting. This is exactly what I'm doing ;)
> 
> I'm using 3 tables to achieve this.
> 
> One table with the URL already crawled (80 millions), one URL with the URL
> to crawle (2 billions) and one URL with the URLs been processed. I'm not
> running any SQL requests against my dataset but I have MR jobs doing many
> different things. I have many other tables to help with the work on the
> URLs.
> 
> I'm "salting" the keys using the URL hash so I can find them back very
> quickly. There can be some collisions so I store also the URL itself on the
> key. So very small scans returning 1 or something 2 rows allow me to
> quickly find a row knowing the URL.
> 
> I also have secondary index tables to store the CRCs of the pages to
> identify duplicate pages based on this value.
> 
> And so on ;) Working on that for 2 years now. I might have been able to use
> Nuthc and others, but my goal was to learn and do that with a distributed
> client on a single dataset...
> 
> Enjoy.
> 
> JM
> 
> 
> 2014/1/3 James Taylor <jtay...@salesforce.com>
> 
>> Sure, no problem. One addition: depending on the cardinality of your
>> priority column, you may want to salt your table to prevent hotspotting,
>> since you'll have a monotonically increasing date in the key. To do that,
>> just add " SALT_BUCKETS=<n>" on to your query, where <n> is the number of
>> machines in your cluster. You can read more about salting here:
>> http://phoenix.incubator.apache.org/salted.html
>> 
>> 
>> On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fancye...@gmail.com> wrote:
>> 
>>> thank you. it's great.
>>> 
>>> On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <jtay...@salesforce.com>
>>> wrote:
>>>> Hi LiLi,
>>>> Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a
>>> SQL
>>>> skin on top of HBase. You can model your schema and issue your queries
>>> just
>>>> like you would with MySQL. Something like this:
>>>> 
>>>> // Create table that optimizes for your most common query
>>>> // (i.e. the PRIMARY KEY constraint should be ordered as you'd want
>> your
>>>> rows ordered)
>>>> CREATE TABLE url_db (
>>>>    status TINYINT,
>>>>    priority INTEGER NOT NULL,
>>>>    added_time DATE,
>>>>    url VARCHAR NOT NULL
>>>>    CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
>>>> 
>>>> int lastStatus = 0;
>>>> int lastPriority = 0;
>>>> Date lastAddedTime = new Date(0);
>>>> String lastUrl = "";
>>>> 
>>>> while (true) {
>>>>    // Use row value constructor to page through results in batches of
>>> 1000
>>>>    String query = "
>>>>        SELECT * FROM url_db
>>>>        WHERE status=0 AND (status, priority, added_time, url) > (?, ?,
>>> ?,
>>>> ?)
>>>>        ORDER BY status, priority, added_time, url
>>>>        LIMIT 1000"
>>>>    PreparedStatement stmt = connection.prepareStatement(query);
>>>> 
>>>>    // Bind parameters
>>>>    stmt.setInt(1, lastStatus);
>>>>    stmt.setInt(2, lastPriority);
>>>>    stmt.setDate(3, lastAddedTime);
>>>>    stmt.setString(4, lastUrl);
>>>>    ResultSet resultSet = stmt.executeQuery();
>>>> 
>>>>    while (resultSet.next()) {
>>>>        // Remember last row processed so that you can start after that
>>> for
>>>> next batch
>>>>        lastStatus = resultSet.getInt(1);
>>>>        lastPriority = resultSet.getInt(2);
>>>>        lastAddedTime = resultSet.getDate(3);
>>>>        lastUrl = resultSet.getString(4);
>>>> 
>>>>        doSomethingWithUrls();
>>>> 
>>>>        UPSERT INTO url_db(status, priority, added_time, url)
>>>>        VALUES (1, ?, CURRENT_DATE(), ?);
>>>> 
>>>>    }
>>>> }
>>>> 
>>>> If you need to efficiently query on url, add a secondary index like
>> this:
>>>> 
>>>> CREATE INDEX url_index ON url_db (url);
>>>> 
>>>> Please let me know if you have questions.
>>>> 
>>>> Thanks,
>>>> James
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancye...@gmail.com> wrote:
>>>> 
>>>>> thank you. But I can't use nutch. could you tell me how hbase is used
>>>>> in nutch? or hbase is only used to store webpage.
>>>>> 
>>>>> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
>>>>> <otis.gospodne...@gmail.com> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Have a look at http://nutch.apache.org .  Version 2.x uses HBase
>>> under
>>>>> the
>>>>>> hood.
>>>>>> 
>>>>>> Otis
>>>>>> --
>>>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fancye...@gmail.com> wrote:
>>>>>> 
>>>>>>> hi all,
>>>>>>>     I want to use hbase to store all urls(crawled or not crawled).
>>>>>>> And each url will has a column named priority which represent the
>>>>>>> priority of the url. I want to get the top N urls order by
>>> priority(if
>>>>>>> priority is the same then url whose timestamp is ealier is
>> prefered).
>>>>>>>     in using something like mysql, my client application may like:
>>>>>>>     while true:
>>>>>>>         select  url from url_db order by priority,addedTime limit
>>>>>>> 1000 where status='not_crawled';
>>>>>>>         do something with this urls;
>>>>>>>         extract more urls and insert them into url_db;
>>>>>>>     How should I design hbase schema for this application? Is
>> hbase
>>>>>>> suitable for me?
>>>>>>>     I found in this article
>> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
>>>>>>> ,
>>>>>>> they use redis to store urls. I think hbase is originated from
>>>>>>> bigtable and google use bigtable to store webpage, so for huge
>> number
>>>>>>> of urls, I prefer distributed system like hbase.
>>

Re: use hbase as distributed crawl's scheduler

Reply via email to