Doğacan Güney wrote: > Hi, > > On 6/26/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >> Is this actually planned (addition of SolrIndexer to Nutch)? >> A search for SolrIndexer in JIRA got no hits. > > There is NUTCH-442 (one of the most popular issues). But, after Sami's > work, there have been no further developments. > > I think Sami Siren's original patch no longer works with Solr, I am > not sure if it still applies to nutch. So, if anyone wants to tackle > this, here are a couple of items off the top of my mind:
It still applies to nutch (actually there were just two additional classes) and works with the original client (don't know if it's still available). I am currently working on something around solr-nutch integration and hoping that I can give out something within the next few weeks. > > 1) Bring Sami's patch up-to-date (both with solr and with nutch). I > think a seperate Indexer job is unnecessary, we should just change > Indexer.OutputFormat to check for a parameter, and if its true, > OutputFormat should also send documents to Solr (besides writing it to > lucene index in DFS). I actually think that the endless adding of configuration options does not do any good to anyone, we should instead start to write reusable pieces of code and/or bring the number of different options down (<imo>The massive number of already available configuration/runtime options and the fact that most of nutch is not designed to be extended by coding is harmful for advanced users. In the other hand I think that things are already too complicated for novice users</imo>) > 2) Make it work in distributed setups (i.e. with more than 1 index > server) . Sami Siren also makes a note of this, but I don't believe > that a simple hash-the-url approach is appropriate for nutch. It would > be nice to guarantee that a url always goes to the same indexing > server, even if we add or remove index servers (if we just take the > hash of url, then adding a new machine would cause pretty much all > urls to be distributed to different servers). I think that the distributed online Index part should be done outside of Nutch (or if done here do it with extreme caution:) so it does not get tied to Nutch. -- Sami Siren ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
