Hello, I think the latest tarball worked for me out of the box. I'm trying to design my Schema at present. My goal is to index PDF/Doc/XLS files with the following fields:
0. ID number 1. Filename 2. File path 3. Modification date 4. File contents 5. Number of pages - Any tips on what type of fields should I use to get this data indexed? - Is there a way to get the ID number incremented automatically by Solr, each time a document is added to the index? - Would I be able to extract the information above using just the Solr/Tika features? Or would I have to source all values myself, except "file contents" and pass them to solr when indexing? Thank you much. Regards, Veselin K On Sat, Dec 27, 2008 at 09:29:05PM -0500, Grant Ingersoll wrote: > Can you provide details about the part of the examples that weren't > clear? Perhaps I can clean up the docs or help you figure it out. > > -Grant > > On Dec 27, 2008, at 3:42 PM, Veselin Kantsev wrote: > >> Hello, >> I am now using solr 1.3 with tomcat6 on a debian lenny box. >> >> Could you please advise of any other instructions/HowTos on >> integrating Tika or >> maybe RichDocumentHandler with Solr, that I can find online? >> Apart from the Solr Wiki, as following those examples did not help in >> my >> case. >> >> >> Thank you. >> >> Veselin K. >> >> >> On Wed, Dec 17, 2008 at 10:43:57AM +0000, Veselin K wrote: >>> Thank you Erik, Hoss. >>> >>> - If using either Solr's "stream.file" or Nutch's crawler, >>> what is the procedure of adding new files? >>> That is to say, if I did not know which are the new files in a >>> specific folder and thus I passed all files to Solr/Nutch, would it >>> skip the ones that have already been indexed? >>> >>> - Also what if I file gets modified, would Solr/Nutch detect >>> the change and re-index just this modified the file? >>> Or should some kind of cache be cleared and everything re-indexed? >>> >>> - In order to provide the user with an option to search the indexes >>> of >>> two separete Solr/Nutch servers, do I need to link both servers >>> somehow and join their indexes into one, or is it just a question of >>> designing the web front-end so that it offers the choice to send >>> your >>> search query to one or multiple different servers. >>> >>> >>> Thank you, >>> Veselin K >>> >>> >>> On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote: >>>> >>>> : the easiest way to get rolling. A simple script that recurses >>>> your folders >>>> : and issues a simple request posting each file in turn to Solr >>>> will give you a >>>> : full text searchable index in no time (well, ok, it'll take a >>>> little time, but >>>> : it'll be as fast as anything else out there). >>>> >>>> if all the files are "local" on the machine that Solr is running >>>> on you >>>> don't even need to POST them, Solr can be configured to read the >>>> files by >>>> local filename using the "stream.file" param... >>>> >>>> http://wiki.apache.org/solr/ContentStream >>>> >>>> that said: if your fileserver implementation already exposes all >>>> of the >>>> files over HTTP, then using Nutch and it's crawler might be an >>>> easier way >>>> to get started on indexing all of them ... hard to say without >>>> being in >>>> your shoes. you may want to experiement with both. >>>> >>>> >>>> >>>> -Hoss >>>> > > -------------------------- > Grant Ingersoll > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > >
