Re: [Nutch-general] Integrate nutch crawler with Solr index server

Doğacan Güney Tue, 26 Jun 2007 07:47:09 -0700

On 6/26/07, Sami Siren <[EMAIL PROTECTED]> wrote:
> Doğacan Güney wrote:
> > Hi,
> >
> > On 6/26/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >> Is this actually planned (addition of SolrIndexer to Nutch)?
> >> A search for SolrIndexer in JIRA got no hits.
> >
> > There is NUTCH-442 (one of the most popular issues). But, after Sami's
> > work, there have been no further developments.
> >
> > I think Sami Siren's original patch no longer works with Solr, I am
> > not sure if it still applies to nutch. So, if anyone wants to tackle
> > this, here are a couple of items off the top of my mind:
>
> It still applies to nutch (actually there were just two additional
> classes) and works with the original client (don't know if it's still
> available).
>
> I am currently working on something around solr-nutch integration and
> hoping that I can give out something within the next few weeks.


Excellent, nice to see you working on this :)

>
> >
> > 1) Bring Sami's patch up-to-date (both with solr and with nutch). I
> > think a seperate Indexer job is unnecessary, we should just change
> > Indexer.OutputFormat to check for a parameter, and if its true,
> > OutputFormat should also send documents to Solr (besides writing it to
> > lucene index in DFS).
>
> I actually think that the endless adding of configuration options does
> not do any good to anyone, we should instead start to write reusable
> pieces of code and/or bring the number of different options down
> (<imo>The massive number of already available configuration/runtime
> options and the fact that most of nutch is not designed to be extended
> by coding is harmful for advanced users. In the other hand I think that
> things are already too complicated for novice users</imo>)

OK, adding new configuration options all the time is probably not a
great idea. But I strongly believe that indexing to different targets
should be done in Indexer.OutputFormat (OutputFormat outputs to
different targets, makes sense to me :). For example, I would love the
ability to index to solr but I would also need to store the original
lucene index in DFS (so that if solr machine dies, I don't lose my
index). I shouldn't have to run Indexer twice to achieve this.

>
> > 2) Make it work in distributed setups (i.e. with more than 1 index
> > server)  . Sami Siren also makes a note of this, but I don't believe
> > that a simple hash-the-url approach is appropriate for nutch. It would
> > be nice to guarantee that a url always goes to the same indexing
> > server, even if we add or remove index servers (if we just take the
> > hash of url, then adding a new machine would cause pretty much all
> > urls to be distributed to different servers).
>
> I think that the distributed online Index part should be done outside of
> Nutch (or if done here do it with extreme caution:) so it does not get
> tied to Nutch.

I am not sure I understand you here. If I have 10 machines I am using
for serving indexes(I am assuming I have a Solr instance running on
each one), IndexerSolr should be able to partition my index to 10
machines.

>
> --
>  Sami Siren
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Integrate nutch crawler with Solr index server

Reply via email to