Nutch ?

Grant Ingersoll Sat, 27 Dec 2008 18:29:42 -0800

Can you provide details about the part of the examples that weren'tclear? Perhaps I can clean up the docs or help you figure it out.


-Grant


On Dec 27, 2008, at 3:42 PM, Veselin Kantsev wrote:

Hello,
I am now using solr 1.3 with tomcat6 on a debian lenny box.
Could you please advise of any other instructions/HowTos onintegrating Tika or
maybe RichDocumentHandler with Solr, that I can find online?
Apart from the Solr Wiki, as following those examples did not helpin my
case.


Thank you.

Veselin K.


On Wed, Dec 17, 2008 at 10:43:57AM +0000, Veselin K wrote:
Thank you Erik, Hoss.

- If using either Solr's "stream.file" or Nutch's crawler,
 what is the procedure of adding new files?
 That is to say, if I did not know which are the new files in a
 specific folder and thus I passed all files to Solr/Nutch, would it
 skip the ones that have already been indexed?

- Also what if I file gets modified, would Solr/Nutch detect
 the change and re-index just this modified the file?
 Or should some kind of cache be cleared and everything re-indexed?
- In order to provide the user with an option to search the indexesof
 two separete Solr/Nutch servers, do I need to link both servers
 somehow and join their indexes into one, or is it just a question of
designing the web front-end so that it offers the choice to sendyour
 search query to one or multiple different servers.


Thank you,
Veselin K


On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
: the easiest way to get rolling. A simple script that recursesyour folders: and issues a simple request posting each file in turn to Solrwill give you a: full text searchable index in no time (well, ok, it'll take alittle time, but
: it'll be as fast as anything else out there).
if all the files are "local" on the machine that Solr is runningon youdon't even need to POST them, Solr can be configured to read thefiles by
local filename using the "stream.file" param...

        http://wiki.apache.org/solr/ContentStream
that said: if your fileserver implementation already exposes allof thefiles over HTTP, then using Nutch and it's crawler might be aneasier wayto get started on indexing all of them ... hard to say withoutbeing in
your shoes.  you may want to experiement with both.



-Hoss


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Indexing local PDFs: Lucene/Solr/Nutch ?

Reply via email to