On 6/26/07, Sami Siren <[EMAIL PROTECTED]> wrote: > Doğacan Güney wrote: > > Hi, > > > > On 6/26/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > >> Is this actually planned (addition of SolrIndexer to Nutch)? > >> A search for SolrIndexer in JIRA got no hits. > > > > There is NUTCH-442 (one of the most popular issues). But, after Sami's > > work, there have been no further developments. > > > > I think Sami Siren's original patch no longer works with Solr, I am > > not sure if it still applies to nutch. So, if anyone wants to tackle > > this, here are a couple of items off the top of my mind: > > It still applies to nutch (actually there were just two additional > classes) and works with the original client (don't know if it's still > available). > > I am currently working on something around solr-nutch integration and > hoping that I can give out something within the next few weeks.
Excellent, nice to see you working on this :) > > > > > 1) Bring Sami's patch up-to-date (both with solr and with nutch). I > > think a seperate Indexer job is unnecessary, we should just change > > Indexer.OutputFormat to check for a parameter, and if its true, > > OutputFormat should also send documents to Solr (besides writing it to > > lucene index in DFS). > > I actually think that the endless adding of configuration options does > not do any good to anyone, we should instead start to write reusable > pieces of code and/or bring the number of different options down > (<imo>The massive number of already available configuration/runtime > options and the fact that most of nutch is not designed to be extended > by coding is harmful for advanced users. In the other hand I think that > things are already too complicated for novice users</imo>) OK, adding new configuration options all the time is probably not a great idea. But I strongly believe that indexing to different targets should be done in Indexer.OutputFormat (OutputFormat outputs to different targets, makes sense to me :). For example, I would love the ability to index to solr but I would also need to store the original lucene index in DFS (so that if solr machine dies, I don't lose my index). I shouldn't have to run Indexer twice to achieve this. > > > 2) Make it work in distributed setups (i.e. with more than 1 index > > server) . Sami Siren also makes a note of this, but I don't believe > > that a simple hash-the-url approach is appropriate for nutch. It would > > be nice to guarantee that a url always goes to the same indexing > > server, even if we add or remove index servers (if we just take the > > hash of url, then adding a new machine would cause pretty much all > > urls to be distributed to different servers). > > I think that the distributed online Index part should be done outside of > Nutch (or if done here do it with extreme caution:) so it does not get > tied to Nutch. I am not sure I understand you here. If I have 10 machines I am using for serving indexes(I am assuming I have a Solr instance running on each one), IndexerSolr should be able to partition my index to 10 machines. > > -- > Sami Siren > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
