[jira] Issue Comment Edited: (NUTCH-650) Hbase Integration

JIRA Mon, 13 Jul 2009 08:43:49 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730380#action_12730380
 ]


Doğacan Güney edited comment on NUTCH-650 at 7/13/09 8:41 AM:
--------------------------------------------------------------

Many changes.

First, for simplicity, I changed master branch to be the main development 
branch. So to take a look at nutchbase simply do:

git clone git://github.com/dogacan/nutchbase.git

(sorry Andrew for the random change :)

* Upgraded to hbase trunk and hadoop 0.20.

* FetcherHbase now fetches URLs in reduce(). I added a randomization part so 
that now reduce does not get URLs from the same host one after another but in a 
random order. Still politeness rules are followed and one host will always be 
in one reducer no matter how many URLs it has (at least, that's what I tried to 
do, testing is welcome :). 

* If your fetch is cut short, you almost do not lost any fetched URL as we 
immediately write the fetched content to the table*. For example, if you are 
doing a HUGE one day fetch, and at the 20th hour your fetch dies, then 20 hour 
fetching worth of URLs will already be in hbase. Next execution of FetcherHbase 
will simply pick up where it left.

* Same thing for ParseTable. If parse crashes in midstream, next execution will 
continue at the crash point*.

* Added a "-restart" option for ParseTable and FetcherHbase. If "-restart" is 
present then these classes start at the beginning instead of continuing from 
whereever last run finished.

* Added a "-reindex" option to IndexerHbase to reindex the entire table 
(Normally only successfully parsed URLs in that iteration are processed).

* Added a SolrIndexerHbase so you can use solr with hbase (which is awesome :). 
Also has a "-reindex" option.

*= We do not immediately write content as hbase client code uses a write buffer 
to buffer updates. Still, you will lose very few URLs as opposed to all (and 
write buffer size can be made smaller for more safety)

There are still some more stuff to go (such as updating scoring for hbase) but 
most of the stuff is, IMHO, ready. Can I get some reviews about what people 
think of the general direction, about API, etc? Because this (and katta 
integration) are my priorities for next nutch.

      was (Author: dogacan):
    Many changes.

First, for simplicity, I changed master branch to be the main development 
branch. So to take a look at nutchbase simply do:

git://github.com/dogacan/nutchbase.git

(sorry Andrew for the random change :)

* Upgraded to hbase trunk and hadoop 0.20.

* FetcherHbase now fetches URLs in reduce(). I added a randomization part so 
that now reduce does not get URLs from the same host one after another but in a 
random order. Still politeness rules are followed and one host will always be 
in one reducer no matter how many URLs it has (at least, that's what I tried to 
do, testing is welcome :). 

* If your fetch is cut short, you almost do not lost any fetched URL as we 
immediately write the fetched content to the table*. For example, if you are 
doing a HUGE one day fetch, and at the 20th hour your fetch dies, then 20 hour 
fetching worth of URLs will already be in hbase. Next execution of FetcherHbase 
will simply pick up where it left.

* Same thing for ParseTable. If parse crashes in midstream, next execution will 
continue at the crash point*.

* Added a "-restart" option for ParseTable and FetcherHbase. If "-restart" is 
present then these classes start at the beginning instead of continuing from 
whereever last run finished.

* Added a "-reindex" option to IndexerHbase to reindex the entire table 
(Normally only successfully parsed URLs in that iteration are processed).

* Added a SolrIndexerHbase so you can use solr with hbase (which is awesome :). 
Also has a "-reindex" option.

*= We do not immediately write content as hbase client code uses a write buffer 
to buffer updates. Still, you will lose very few URLs as opposed to all (and 
write buffer size can be made smaller for more safety)

There are still some more stuff to go (such as updating scoring for hbase) but 
most of the stuff is, IMHO, ready. Can I get some reviews about what people 
think of the general direction, about API, etc? Because this (and katta 
integration) are my priorities for next nutch.
  
> Hbase Integration
> -----------------
>
>                 Key: NUTCH-650
>                 URL: https://issues.apache.org/jira/browse/NUTCH-650
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.1
>
>         Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
> malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, 
> nutch-habase.patch, searching.diff, slash.patch
>
>
> This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-650) Hbase Integration

Reply via email to