Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Christian Klinger
Hi Solr Users, i have set up a Solr-Server with a custom Schema. Now i have updated the index with some content form xml-files. Now i try to update the contents of a folder. The folder consits of various document-types (pdf,doc,xls,...). Is there anywhere an howto how can i parse the

Re: Embedded solr - reload searcher

2007-08-21 Thread Erik Hatcher
For other Solr instances (whether embedded or not) to refresh their index searchers, send a commit/ message to them. Erik On Aug 21, 2007, at 7:33 AM, sinking wrote: Hello, I have tried to use the EmbeddedSolr (http://wiki.apache.org/solr/ EmbeddedSolr) because i want to work

RE: How to read values of a field efficiently

2007-08-21 Thread Martin Grotzke
On Tue, 2007-08-21 at 11:52 +0200, Ard Schrijvers wrote: you're missing the key piece that Ard alluded to ... the there is one ordere list of all terms stored in the index ... a TermEnum lets you iterate over this ordered list, and the IndexReader.terms(Term) method lets you

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Peter Manis
Installing the patch requires downloading the latest solr via subversion and applying the patch to the source. Eric has updated his patch with various revisions of subversion. To make sure it will compile I suggest getting the revision he lists. As for using the features of this patch. This is

Using MMapDirectory instead of FSDirectory

2007-08-21 Thread Jérôme Etévé
Hi ! Is there a way to use a MMapDirectory instead of FSDirectory within Solr ? Our index is quite big and It takes a long time to go up in the OS cached memory. I'm wondering if an MMapDirectory could help to have our data in memory quicker (our index on disk is bigger than our memory

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Peter Manis
I am a little confused how you have things setup, so these meta data files contain certain information and there may or may not be a pdf, xls, doc that it is associated with? If that is the case, if it were me I would write something to parse the meta data files, and if there is a binary file

Index HotSwap

2007-08-21 Thread Jérôme Etévé
Hi all, I'm wondering what's the best way to completely change a big index without loosing any requests. That's how I do at the moment: solr index is a soft link to a directory dir. When I want to install a new index (in dir.new), I do a mv dir dir.old ; mv dir.new dir Then I ask for a

RE: Index HotSwap

2007-08-21 Thread Gunther, Andrew
I guess the first question is why you have to swap in a big index, instead of rsyc'ng or another method. I've entertained the idea of putting a load balancer in front of two solr instances. In this scenario take one off-line swap in the index, bring it back on and then bring down the other.

RE: Commit performance

2007-08-21 Thread Gunther, Andrew
I've seen even longer commit times with our 2GB index and have not had a chance to look into it deeper. What I have noticed is when there are Searchers registered commits take a lot longer time. Perhaps looking at the optional attributes for commit (waitSearcher, waitFlush) would help. Since we

Re: solved: quering UTF-8 encoded CSV files

2007-08-21 Thread Chris Hostetter
: The conclusion is that setting URIEncoding=UTF-8 in the Connector : section in server.xml is not enough : : I also needed to add -Dfile.encoding=UTF-8 to the tomcat’s java : startup options (in catalina.bat) seeing how you resolved this problem, has got me thinking ... how did you index the

Re: Commit performance

2007-08-21 Thread Chris Hostetter
: How long should a commit take? I've got about 9.8G of data for 9M of : records. (Yes, I'm indexing too much data.) My commits are taking 20-30 the low levels of updating aren't my forte, but as i recall the dominant factor in how long it takes to execute a commit is the number of deleted

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Vish D.
On 8/21/07, Peter Manis [EMAIL PROTECTED] wrote: I am a little confused how you have things setup, so these meta data files contain certain information and there may or may not be a pdf, xls, doc that it is associated with? Yes, you have it right. If that is the case, if it were me I would

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Vish D.
On 8/21/07, Vish D. [EMAIL PROTECTED] wrote: On 8/21/07, Peter Manis [EMAIL PROTECTED] wrote: I am a little confused how you have things setup, so these meta data files contain certain information and there may or may not be a pdf, xls, doc that it is associated with? Yes, you have it

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Peter Manis
I cant find the documentation, but I believe apache's max url is 8192, so I would assume a lot of other apps like tomcat and jetty would be similar. I havn't run into any problems yet. Maybe shoot Eric an email and see if he would be interested in adapting the code to take XML as well so that

RE: clear index

2007-08-21 Thread Sundling, Paul
Trying the query approach with a 3GB indexing takes over a minute to clear the index. The reason why to not stop the servlet container and delete the files manually is that in a particular environment the person testing may not have access to the filesystem directly. Usually you want to do

RE: clear index

2007-08-21 Thread Lance Norskog
It might be worthwhile to have a hibernate mode for solr, where suspend waits until all requests are finished, then closes all files and rejects all new requests. Later a wakeup command would bring it back online. During this time, a remotely controlled job could remove the data directory. This

Re: Structured Lucene documents

2007-08-21 Thread Pieter Berkel
On 21/08/07, Pierre-Yves LANDRON [EMAIL PROTECTED] wrote: It seems the highlights fields must be specified, and that I can't use the * completion to do so. Am I true ? Is there a way to go throught this obligation ? As far as I know, dynamic fields are used mainly at during indexing and

Replacing existing documents

2007-08-21 Thread Lance Norskog
Recently someone mentioned that it would be possible to have a 'replace existing document' feature rather than just dropping and adding documents with the same unique id. We have a few use cases in this area and I'm researching whether it is effective to check for a document via Solr queries, or

RE: Commit performance

2007-08-21 Thread Chris Hostetter
: chance to look into it deeper. What I have noticed is when there are : Searchers registered commits take a lot longer time. Perhaps looking at that's probably the warming time taken to reopen the new searcher ... waitSearcher=false should cause those commits to reutrn much faster (the down

RE: clear index

2007-08-21 Thread Chris Hostetter
: I'm just seeing if there's an easy/performant way of doing it with Solr. : For a solution with raw Lucene, creating a new index with the same : directory cleared out an old index (even on Windows with it's file : locking) quickly. there has been talk of optimizing delete by query in the case