Re: [Nutch-general] Integrate nutch crawler with Solr index server

Sami Siren Tue, 26 Jun 2007 07:15:44 -0700

Doğacan Güney wrote:
> Hi,
> 
> On 6/26/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> Is this actually planned (addition of SolrIndexer to Nutch)?
>> A search for SolrIndexer in JIRA got no hits.
> 
> There is NUTCH-442 (one of the most popular issues). But, after Sami's
> work, there have been no further developments.
> 
> I think Sami Siren's original patch no longer works with Solr, I am
> not sure if it still applies to nutch. So, if anyone wants to tackle
> this, here are a couple of items off the top of my mind:


It still applies to nutch (actually there were just two additional
classes) and works with the original client (don't know if it's still
available).

I am currently working on something around solr-nutch integration and
hoping that I can give out something within the next few weeks.

> 
> 1) Bring Sami's patch up-to-date (both with solr and with nutch). I
> think a seperate Indexer job is unnecessary, we should just change
> Indexer.OutputFormat to check for a parameter, and if its true,
> OutputFormat should also send documents to Solr (besides writing it to
> lucene index in DFS).

I actually think that the endless adding of configuration options does
not do any good to anyone, we should instead start to write reusable
pieces of code and/or bring the number of different options down
(<imo>The massive number of already available configuration/runtime
options and the fact that most of nutch is not designed to be extended
by coding is harmful for advanced users. In the other hand I think that
things are already too complicated for novice users</imo>)

> 2) Make it work in distributed setups (i.e. with more than 1 index
> server)  . Sami Siren also makes a note of this, but I don't believe
> that a simple hash-the-url approach is appropriate for nutch. It would
> be nice to guarantee that a url always goes to the same indexing
> server, even if we add or remove index servers (if we just take the
> hash of url, then adding a new machine would cause pretty much all
> urls to be distributed to different servers).

I think that the distributed online Index part should be done outside of
Nutch (or if done here do it with extreme caution:) so it does not get
tied to Nutch.

-- 
 Sami Siren

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Reply via email to