Nutch ?

Veselin K Wed, 17 Dec 2008 02:44:32 -0800

Thank you Erik, Hoss.

- If using either Solr's "stream.file" or Nutch's crawler,
  what is the procedure of adding new files?
  That is to say, if I did not know which are the new files in a
  specific folder and thus I passed all files to Solr/Nutch, would it
  skip the ones that have already been indexed?


- Also what if I file gets modified, would Solr/Nutch detect
  the change and re-index just this modified the file? 
  Or should some kind of cache be cleared and everything re-indexed?

- In order to provide the user with an option to search the indexes of
  two separete Solr/Nutch servers, do I need to link both servers
  somehow and join their indexes into one, or is it just a question of
  designing the web front-end so that it offers the choice to send your
  search query to one or multiple different servers.


Thank you,
Veselin K


On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
> 
> : the easiest way to get rolling.  A simple script that recurses your folders
> : and issues a simple request posting each file in turn to Solr will give you 
> a
> : full text searchable index in no time (well, ok, it'll take a little time, 
> but
> : it'll be as fast as anything else out there).
> 
> if all the files are "local" on the machine that Solr is running on you 
> don't even need to POST them, Solr can be configured to read the files by 
> local filename using the "stream.file" param...
> 
>       http://wiki.apache.org/solr/ContentStream
>
> that said: if your fileserver implementation already exposes all of the 
> files over HTTP, then using Nutch and it's crawler might be an easier way 
> to get started on indexing all of them ... hard to say without being in 
> your shoes.  you may want to experiement with both.
> 
> 
> 
> -Hoss
>

Re: Indexing local PDFs: Lucene/Solr/Nutch ?

Reply via email to