[jira] Updated: (NUTCH-442) Integrate Solr/Nutch

JIRA Tue, 31 Jul 2007 06:20:16 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Doğacan Güney updated NUTCH-442:
--------------------------------

    Attachment: RFC_multiple_search_backends.patch

Here is my (very large - sorry) patch for this issue:

Patch consists of two parts:

1) Support for multiple indexing backends: See NUTCH-520 for details. This 
patch includes the latest patch from NUTCH-520.

2) Support for multiple search backends:

* DistributedSearch.Client is removed.

* Search is divided into three main parts:
    - SearchBean: implements Searcher and HitDetailer
    - SegmentBean: implements HitContent and HitSummarizer
    - HitInlinks: same as old

    This division may seem arbitrary (and it actually is), however these 
abstractions are useful enough that Solr and nutch's search server can work. If 
later further abstractions are needed for new search backends, they can be 
added.

     This division also has a nice side effect: Currently, an search server 
searches lucene indexes _and_ generate summaries for results. After this patch, 
it is now possible to start a search server that searches an index and a 
'segment server' (that returns cached content of pages, generates summaries, 
etc.) seperately. DistributedSearch$IndexServer (uses LuceneSearchBean) and 
DistributedSearch$SegmentServer (uses FetchedSegments) classes are added for 
this.

*  SearchBean hierarchy is like this:

   SearchBean (extends Searcher, HitDetailer)

       RPCSearchBean (extends SearchBean, VersionedProtocol)

           LuceneSearchBean (implements RPCSearchBean, searches lucene indexes 
(may be local or on dfs), can also respond to RPC requests)

       SolrSearchBean (implements SearchBean, processes responses from a SOLR 
server)

       DistributedSearchBean (implements SearchBean, is also a container of 
SearchBeans. This class implements the searching part of 
DistributedSearch$Client. Sends parallel connections to multiple beans and 
merges their results. Does not use RPC.call API (since not all beans support 
hadoop's RPC), instead uses a modern threading pool for parallel requests.

* Location of remote nutch/lucene servers are still read from 
crawl/search-servers.txt. Location of solr servers are read from 
crawl/solr-servers.txt (yes, it supports searching from more than 1 solr 
servers).

* DistributedSearchBean routinely sends pings to its beans. If a bean fails to 
respond, it is removed from active list of search servers (so that it doesn't 
block searching). For example, if solr server dies, DistributedSearchBean 
realizes this and stops sending search requests to solr server. Later when solr 
comes back up, DistributedSearchBean re-adds it to active search server list.

* SegmentBean is similar:
    
    SegmentBean (extends HitContents, HitSummarizer)

        RPCSegmentBean (extends SegmentBean, VersionedProtocol),

            FetchedSegments (is similar to older version)

* DistributedSearch$SegmentServer (which uses FetchedSegments internally) reads 
its config from crawl/segment-servers.txt .

* I also added a couple of utility classes for sending requests to solr and 
processing responses (under o.a.n.util.solr)

Sorry, if the description is a bit complex (however, code itsef should be easy 
to understand) . Comments, suggestions, reviews and all other sorts of feedback 
are welcome.


> Integrate Solr/Nutch
> --------------------
>
>                 Key: NUTCH-442
>                 URL: https://issues.apache.org/jira/browse/NUTCH-442
>             Project: Nutch
>          Issue Type: New Feature
>         Environment: Ubuntu linux
>            Reporter: rubdabadub
>         Attachments: RFC_multiple_search_backends.patch
>
>
> Hi:
> After trying out Sami's patch regarding Solr/Nutch. Can be found here 
> (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html)
>  and I can confirm it worked :-) And that lead me to request the following :
> I would be very very great full if this could be included in nutch 0.9 as I 
> am trying to eliminate my python based crawler which post documents to solr. 
> As I am in the corporate enviornment I can't install trunk version in the 
> production enviornment thus I am asking this to be included in 0.9 release. I 
> hope my wish would be granted.
> I look forward to get some feedback.
> Thank you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-442) Integrate Solr/Nutch

Reply via email to