Lucene 2.1: java.io.IOException: Lock obtain timed out: SimpleFSLock@path of index file
All, We encounter issues while updating the lucene index, here is the stack trace: Caused by: java.io.IOException: Lock obtain timed out: SimpleFSLock@/data/www/orcanta/lucene/store1/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:69) at org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:526) at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:551) at org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:578) at com.bi.commerce.service.catalog.spring.lucene.LuceneIndex.deleteFromIndex(Luc eneIndex.java:692) ... 25 more Here is the source code of the lucene API invocation where the error occurs: class com.bi.commerce.service.catalog.spring.lucene.LuceneIndex: import org.apache.lucene.index.IndexReader; ... public synchronized void deleteFromIndex(ICatalogEntity entity) { if(!indexExists()) return; try { IndexReader reader = IndexReader.open(store); String uid = getUID(entity); try{ line 692= reader.deleteDocuments(new Term(uid,uid)); }catch(ArrayIndexOutOfBoundsException e){ //CHECK ignore this. Can happen if index has not been built yet (??) } reader.close(); } catch (IOException e) { throw new SearchEngineException(e); }catch(RuntimeException e){ throw new SearchEngineException(e); } } Do I something wrong? If somebody already encountered this error, or knows a fix, I'm really interested! Thanks in advance, BRegards. -Jerome Chauvin- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ANN] ParallelSearcher in multi-node environment
e.g. I've changed original ParallelSearcher to use thread pool (java.util.concurrent.ThreadPoolExecutor from jdk 1.5). But implementing multi-host installation still requires a lot of changes since ParallelSearcher calles underlying Searchables too many times (e.g. for separate network call for every document) Dmitri -- View this message in context: http://www.nabble.com/ParallelSearcher-in-multi-node-environment-tf3301080.html#a9245525 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing performance
On Tue, Feb 27, 2007, Saravana wrote about indexing performance: Hi, Is it possible to scale lucene indexing like 2000/3000 documents per second? I don't know about the actual numbers, but one trick I've used in the past to get really fast indexing was to create several independent indexes in parallel. Simply, if you have, say, 4 CPUs and perhaps even several physical disks, run 4 indexing processes each indexing a 1/4 of the files and creating a separate index (on separate disks on separate IO channels, if possible). At the end, you have 4 indexes which you can actually search together without any real need to merge them, unless query performance is very important to you as well. I need to index 10 fields each with 20 bytes long. I should be able to search by just giving any of the field values as criteria. I need to get the count that has same field values. You need just the counts? And you want to do just whole-field matching, not word matching? In that case, Lucene might be an overkill for you. Or, if you do use Lucene, make sure to use keyword (untokenized) fields, not tokenized fields. -- Nadav Har'El| Thursday, Mar 1 2007, 11 Adar 5767 IBM Haifa Research Lab |- |Open your arms to change, but don't let http://nadav.harel.org.il |go of your values. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Update - IOException
Hi, While updating my index I have the following error : [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R java.io.IOException: Lock obtain timed out: [EMAIL PROTECTED]:\TEMP\lucene-b56f455aea0a705baecaa4411d590aa2-write.lock [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R at org.apache.lucene.store.Lock.obtain(Lock.java:56) [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R at org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:489 ) [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:514) [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R at org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:541 ) I am using lucene 2.0, When I execute the code below I find an entry with the specified Term (it displays One Entry Found) Then when I try to delete the document, I get the error I apsted above. What I do is : open a second index reader, delete document close second index reader close main index reader, open new idnexreader Can anyone help ? Thank u very much. // Open second indexReader IndexReader mIndexReaderClone = null; try { mIndexReaderClone = IndexReader.open(mWorkingIndexDir); } catch (IOException exc) { exc.printStackTrace(); throw new RegainException(Creating index reader failed, exc); } Term urlTerm = new Term(url, url1); Query query2 = new TermQuery(urlTerm2); Document doc2; Hits hits2 = search(query2); if (hits2.length() 0) { if (hits2.length() 1) { System.out.println(Duplicates Entries); } System.out.println(One Entry Found); } else { System.out.println(No Entries); } try { mIndexReaderClone.deleteDocuments(urlTerm); } catch (IOException e) { e.printStackTrace(); throw new RegainException(Deleting old entry failed, e); } // Close the Clone IndexReader try { mIndexReaderClone.close(); } catch (IOException exc) { throw new RegainException(Closing index reader failed, exc); } __ Matt Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Update - IOException
I deleted the lock file, now it seems to work ... When can such an error happen ? __ Matt From: DECAFFMEYER MATHIEU [mailto:[EMAIL PROTECTED] Sent: Thursday, March 01, 2007 9:56 AM To: java-user@lucene.apache.org Subject: Update - IOException * This message comes from the Internet Network * Hi, While updating my index I have the following error : [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R java.io.IOException: Lock obtain timed out: [EMAIL PROTECTED]:\TEMP\lucene-b56f455aea0a705baecaa4411d590aa2-write.lock [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R at org.apache.lucene.store.Lock.obtain(Lock.java:56) [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R at org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:489 ) [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:514) [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R at org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:541 ) I am using lucene 2.0, When I execute the code below I find an entry with the specified Term (it displays One Entry Found) Then when I try to delete the document, I get the error I apsted above. What I do is : open a second index reader, delete document close second index reader close main index reader, open new idnexreader Can anyone help ? Thank u very much. // Open second indexReader IndexReader mIndexReaderClone = null; try { mIndexReaderClone = IndexReader.open(mWorkingIndexDir); } catch (IOException exc) { exc.printStackTrace(); throw new RegainException(Creating index reader failed, exc); } Term urlTerm = new Term(url, url1); Query query2 = new TermQuery(urlTerm2); Document doc2; Hits hits2 = search(query2); if (hits2.length() 0) { if (hits2.length() 1) { System.out.println(Duplicates Entries); } System.out.println(One Entry Found); } else { System.out.println(No Entries); } try { mIndexReaderClone.deleteDocuments(urlTerm); } catch (IOException e) { e.printStackTrace(); throw new RegainException(Deleting old entry failed, e); } // Close the Clone IndexReader try { mIndexReaderClone.close(); } catch (IOException exc) { throw new RegainException(Closing index reader failed, exc); } __ Matt Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best way to returning hits after search?
If you decide to cache stored field value in memory, FieldCache may be useful for this - so you don't have to implement your own cache - you can access the field values with something like: FieldCache fieldCache = FieldCache.DEFAULT; String db_id_field[] = fieldCache.getStrings(indexReader,DB_ID_FIELD_NAME); Those values are valid for the lifetime of the index-reader. Once a new index reader is opened, when GC collects the unused old index reader object, it would also be able to collect (from the cache) unused field values. Thanks for the pointers Doron. I'll take a look at that. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 2.1: java.io.IOException: Lock obtain timed out: SimpleFSLock@path of index file
Jerome Chauvin [EMAIL PROTECTED] wrote: We encounter issues while updating the lucene index, here is the stack trace: Caused by: java.io.IOException: Lock obtain timed out: SimpleFSLock@/data/www/orcanta/lucene/store1/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:69) at org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:526) at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:551) at org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:578) at com.bi.commerce.service.catalog.spring.lucene.LuceneIndex.deleteFromIndex(Luc eneIndex.java:692) ... 25 more First off, you have to ensure only one writer (either IndexWriter or as in this case IndexReader doing deletes) is trying to update the index at the same time. Lucene only allows one writer on an index, and if a second writer tries to open it will receive exactly this exception. (Note that as of 2.1 you can now do deletes with IndexWriter which simplifies things because you can use a single IndexWriter for adds/updates/deletes.) If you are already doing that (single writer) correctly, the other common cause is that this is a leftover lock file (for example if the JVM crashed or was killed or even if you didn't close a previous writer before the JVM exited). There is a better locking implementation (NativeFSLockFactory) that correctly frees the lock when the JVM crashes so you may want to use that one instead if you hit this often (but first explain the root cause of your crashes!). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Update - IOException
DECAFFMEYER MATHIEU [EMAIL PROTECTED] wrote: I deleted the lock file, now it seems to work ... When can such an error happen ? See my response I just sent to java-user on this same error. Even though you are running Lucene 2.0, the same causes can lead to that Lock obtain timed out exception. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Spanned indexes
Hi all, Is it possible in Lucene for an index to span multiple files? If so what is the recommendation in this case? Is it better to span after the index reaches a particular size? Furthermore, does Lucene ever span a single record between two or more index files in this case or does it ensure that a single record will only appear in one spanned file? Many thanks for your advice Sachin This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to.
Re: Sorting by Score
Peter: About a custom ScoreComparator. The problem I couldn't get past was that I needed to know the max score of all the docs in order to divide the raw scores into quintiles since I was dealing with raw scores. I didn't see how to make that work with ScoreComparator, but I confess that I didn't look very hard after someone on the list turned me on to FieldSortedHitQueue Erick On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote: It may well be, but as I said this is efficient enough for my needs so I didn't pursue it. One of my pet peeves is spending time making things more efficient when there's no need, and my index isn't going to grow enough larger to worry about that now G... Erick On 2/28/07, Peter Keegan [EMAIL PROTECTED] wrote: Erich, Yes, this seems to be the simplest way to implement score 'bucketization', but wouldn't it be more efficient to do this with a custom ScoreComparator? That way, you'd do the bucketizing and sorting in one 'step' (compare()). Maybe the savings isn't measurable, though. A comparator might also allow one to do a more sophisticated rounding or bucketizing since you'd be getting 2 scores at a time. Peter On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote: Empirically, when I insert the elements in the FieldSortedHitQueue they get sorted according to the Sort object. The original query that gives me a TopDocs applied no secondary sorting, only relevancy. Since I normalized all the scores into one of only 5 discrete values, and secondary sorting was applied to all docs with the same score when I inserted them in the FieldSortedHitQueue. Now popping things of the FieldSortedHitQueue is ordered the way I want. You could just operate on the FieldSortedHitQueue at this point, but I decided the rest of my code would be simpler if I stuffed them back into the TopDocs, so there's some explanation below that you can just skip if I've cleared things up already. * The step I left out is moving the documents from the FIeldSortedHitQueue back to topDocs.scoreDocs. So the steps are as follows.. 1 bucketize the scores. That is, go through the TopDocs.scoreDocs and adjust each raw score into one of my buckets. This is made easy by the existence of topDocs.getMaxScore . TopDocs has had no sorting other than relevancy applied so far. 2 assemble the FieldSortedHitQueue by inserting each element from scoreDocs into it, with a suitable Sort object, relevance is the first field ( SortField.FIELD_SCORE). 3 pop the entries off the FieldSortedHitQueue, overwriting the elements in topDocs.scoreDocs. I left out step 3, although I suppose you could operate directly on the FieldSortedHitQueue. NOTE: in my case, I just put everything back in the scoreDocs without attempting any efficiencies. If I needed more performance, I'd only put as many items back as I needed to display. But as I wrote yesterday, performance isn't an issue so there's no point. Although I know one place to look if we need to squeeze more QPS. How efficient this is is an open question. But it's fast enough and relatively simple so I stopped looking for more efficiencies Erick On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The first part was just to iterate through the TopDocs that's available to : my and normalize the scores right in the ScoreDocs. Like this... Won't that be done after the Lucene does the hitcollecting/sorting? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. (or am i missing something about your description?) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ANN] ParallelSearcher in multi-node environment
yeah I am too looking forward to this feature, using thread pool and minimize the remote calls in ParallelSearcher [EMAIL PROTECTED] wrote: e.g. I've changed original ParallelSearcher to use thread pool (java.util.concurrent.ThreadPoolExecutor from jdk 1.5). But implementing multi-host installation still requires a lot of changes since ParallelSearcher calles underlying Searchables too many times (e.g. for separate network call for every document) Dmitri - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting by Score
Erick, I think you're right because you'd wouldn't know the max score before the comparisons. I'm just thinking about a rounding algorithm that involves comparing the raw scores to the theoretical maximum score, which I think could be computed from the Similarity class and knowing the max boost value used during indexing. Peter On 3/1/07, Erick Erickson [EMAIL PROTECTED] wrote: Peter: About a custom ScoreComparator. The problem I couldn't get past was that I needed to know the max score of all the docs in order to divide the raw scores into quintiles since I was dealing with raw scores. I didn't see how to make that work with ScoreComparator, but I confess that I didn't look very hard after someone on the list turned me on to FieldSortedHitQueue Erick On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote: It may well be, but as I said this is efficient enough for my needs so I didn't pursue it. One of my pet peeves is spending time making things more efficient when there's no need, and my index isn't going to grow enough larger to worry about that now G... Erick On 2/28/07, Peter Keegan [EMAIL PROTECTED] wrote: Erich, Yes, this seems to be the simplest way to implement score 'bucketization', but wouldn't it be more efficient to do this with a custom ScoreComparator? That way, you'd do the bucketizing and sorting in one 'step' (compare()). Maybe the savings isn't measurable, though. A comparator might also allow one to do a more sophisticated rounding or bucketizing since you'd be getting 2 scores at a time. Peter On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote: Empirically, when I insert the elements in the FieldSortedHitQueue they get sorted according to the Sort object. The original query that gives me a TopDocs applied no secondary sorting, only relevancy. Since I normalized all the scores into one of only 5 discrete values, and secondary sorting was applied to all docs with the same score when I inserted them in the FieldSortedHitQueue. Now popping things of the FieldSortedHitQueue is ordered the way I want. You could just operate on the FieldSortedHitQueue at this point, but I decided the rest of my code would be simpler if I stuffed them back into the TopDocs, so there's some explanation below that you can just skip if I've cleared things up already. * The step I left out is moving the documents from the FIeldSortedHitQueue back to topDocs.scoreDocs. So the steps are as follows.. 1 bucketize the scores. That is, go through the TopDocs.scoreDocs and adjust each raw score into one of my buckets. This is made easy by the existence of topDocs.getMaxScore . TopDocs has had no sorting other than relevancy applied so far. 2 assemble the FieldSortedHitQueue by inserting each element from scoreDocs into it, with a suitable Sort object, relevance is the first field ( SortField.FIELD_SCORE). 3 pop the entries off the FieldSortedHitQueue, overwriting the elements in topDocs.scoreDocs. I left out step 3, although I suppose you could operate directly on the FieldSortedHitQueue. NOTE: in my case, I just put everything back in the scoreDocs without attempting any efficiencies. If I needed more performance, I'd only put as many items back as I needed to display. But as I wrote yesterday, performance isn't an issue so there's no point. Although I know one place to look if we need to squeeze more QPS. How efficient this is is an open question. But it's fast enough and relatively simple so I stopped looking for more efficiencies Erick On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The first part was just to iterate through the TopDocs that's available to : my and normalize the scores right in the ScoreDocs. Like this... Won't that be done after the Lucene does the hitcollecting/sorting? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. (or am i missing something about your description?) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Spanned indexes
Sachin, A lof of the questions you are asking are covered either in the FAQ or on the Lucene site somewhere, or in various Lucene articles or in LIA. You should check those places first (the traffic on java-user is already high!), you'll save yourself a lot of time. For this particular question, have a look at the File Formats page on Lucene's site. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Kainth, Sachin [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, March 1, 2007 7:21:52 AM Subject: Spanned indexes Hi all, Is it possible in Lucene for an index to span multiple files? If so what is the recommendation in this case? Is it better to span after the index reaches a particular size? Furthermore, does Lucene ever span a single record between two or more index files in this case or does it ensure that a single record will only appear in one spanned file? Many thanks for your advice Sachin This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
retrieve term positions in query
Hi! My problem is to retrieve the term positions in a general query with more than one terms. It seems that with the phrase query it's possible (with SpanQuery) but with AND and OR query I can't get the position for each document I search. I'm looking for a high level implementation because I don't want to use low level lucene API (I'm a lucene newbie...) Thanx in advance, Mat -- View this message in context: http://www.nabble.com/retrieve-term-positions-in-query-tf3327146.html#a9250330 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Performance in having Multiple Index files
Yes, it will affect the search performance because you need to merge the results from the different indexes. The best performance is from a single index. The more indexes you have the more time it takes to search. Aviran http://www.aviransplace.com -Original Message- From: Raaj [mailto:[EMAIL PROTECTED] Sent: Thursday, March 01, 2007 2:50 AM To: java-user@lucene.apache.org Subject: Performance in having Multiple Index files hi all, i have requirement where in i create an index file for each xml file . i have over 100/150 xml files which are all related . if create 100/150 index files and query using these indices , will this affect the performance of the search operation . bye raaj - Need a quick answer? Get one in minutes from people who know. Ask your question on Yahoo! Answers. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
question about ScoreDocComparator
Hello- One of the fields in my index is an ID, which maps to a full text description behind the scenes. Now I want to sort the search results alphabetically according to the description, not the ID. This can be done via SortComparatorSource and a ScoreDocComparator without problems. But the code needed to do this is quite complicated - it involves retrieving the document ID from the ScoreDoc, then looking up the Document through an IndexReader, and then retrieving the ID field from the document. It seems that there should be an easier way to get at the ID field, since that is the one being used for the sort. There is a related class FieldDoc, through which it seems possible to get at the field values, but that doesn't seem applicable here. I went through the custom sorting example of Lucene In Action, but that doesn't deal with this case. Am I missing something obvious? Thanks in advance, Ulf - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Fwd: Re: indexing performance]
Hi, You need just the counts? And you want to do just whole-field matching, not word matching? In that case, Lucene might be an overkill for you. Or, if you do use Lucene, make sure to use keyword (untokenized) fields, not tokenized fields. Sorry for not elaborating my requirement more. Actually I have some fields that need word matching and for some fields I do not need word matching. I have used NO_NORMS for whole fields and TOKENIZED for the fields that need normalization. I need count as well as I need to show the fields that are indexed. For example the following criteria can be given by the user; USER:john AND MSG:ftp Here USER is NO_NORMS field and MSG will be tokenized field. Original log message will be as follows. 2007 Jan 27 10:10:01 User John accessed ftp url images.html So i cannot identify the count in the memory as the criteria will be selected by the user or its not predefined. Moreover I have read the following thread dated 2002 Thread on 2002: my experiences are that the writing to the index takes the most time except any parsing done by the user. I have been working on xml indexes and here the collection of data takes just as much time as to write. to increase *speed*i have done three things that reduced my index time from 11hours to 2,5 hours for the same dataset (1,3gb xml documents). 1: i index 50 documents into a ramdir, then when the limit is reached i merge this ramdir into a fsdir and flush the ramdir. this speeds up things as i then don't have to use the fsdir as much and ramdir is much faster. 2: merging a large index into a large index takes nearly as much time as merging a small index into a large index, so i have 4 (any number will do) fsdirs that i write ramdirs to and then i merge these fsdirs into one large fsdir at the end of a large indexrun. 3: multithreaded my application, create workerthreads that indexes into its own sepparate ramdir, then flushes these ramdirs into each separate fsdir (hench i have a fsdir for each workerthread), this because you can only write to a dir by one thread. in the end this imporved my *indexing* time a lot... hope some of this can help you! mvh karl �ie Is this still hold good now ? Thanks for your reply. regards, MSK -- Forwarded message -- From: Nadav Har'El [EMAIL PROTECTED] To: java-user@lucene.apache.org Date: Thu, 1 Mar 2007 10:28:07 +0200 Subject: Re: indexing performance On Tue, Feb 27, 2007, Saravana wrote about indexing performance: Hi, Is it possible to scale lucene indexing like 2000/3000 documents per second? I don't know about the actual numbers, but one trick I've used in the past to get really fast indexing was to create several independent indexes in parallel. Simply, if you have, say, 4 CPUs and perhaps even several physical disks, run 4 indexing processes each indexing a 1/4 of the files and creating a separate index (on separate disks on separate IO channels, if possible). At the end, you have 4 indexes which you can actually search together without any real need to merge them, unless query performance is very important to you as well. I need to index 10 fields each with 20 bytes long. I should be able to search by just giving any of the field values as criteria. I need to get the count that has same field values. You need just the counts? And you want to do just whole-field matching, not word matching? In that case, Lucene might be an overkill for you. Or, if you do use Lucene, make sure to use keyword (untokenized) fields, not tokenized fields. -- Nadav Har'El| Thursday, Mar 1 2007, 11 Adar 5767 IBM Haifa Research Lab |- |Open your arms to change, but don't let http://nadav.harel.org.il |go of your values. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TextMining.org Word extractor
On Feb 23, 2007, at 2:00 PM, [EMAIL PROTECTED] wrote: Re: TextMining.org Word extractor Someone noted that textmining.org gets hacked. There is test- mining.org which appears to be a commercial site. Can someone tell me where to get the download of the original GPL textmining.org software? Thanks.
RE: TextMining.org Word extractor
I can't speak to where you can get a copy of the original code, but the modified code I have is not GPL licenced - the license header in at least one file is as follows: /* Copyright 2004 Ryan Ackley * * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ Regards, Bruce Ritchie -Original Message- From: Bill Taylor [mailto:[EMAIL PROTECTED] Sent: Thursday, March 01, 2007 11:00 AM To: java-user@lucene.apache.org Subject: Re: TextMining.org Word extractor On Feb 23, 2007, at 2:00 PM, [EMAIL PROTECTED] wrote: Re: TextMining.org Word extractor Someone noted that textmining.org gets hacked. There is test- mining.org which appears to be a commercial site. Can someone tell me where to get the download of the original GPL textmining.org software? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: document field updates
On Feb 28, 2007, at 8:59 AM, Steven Parkes wrote: Are unindexed fields stored seperately from the main inverted index? If so then, one could implement the field value change as a delete and re-add of just that value? The short answer is that won't work. Field values are stored in a different data structure than the postings lists but docids are consistent across all contents of a segment. Deleting something and readding it is going to put it into a different segment which is going to keep this from working. (Not to mention that you want the postings lists updated if you want it to be searchable ...) Are you aware of some implementation of Lucene that solves this need well with a second index for 'tags' complete with multi-index boolean queries? I'm pretty sure this has been done, I'm just not 100% sure where. Does Nutch index link text? Nutch does do this sort of thing, but I'm not quite sure how. It isn't doing any operations to the Lucene index beyond what plain ol' Lucene does. I don't know if Solr has anything like this but if I remember correctly, Collex has tags but as far as I can tell, it's not been open sourced (yet?) Collex is quite open source, its just ugly source :) We're the 'patacriticism' project at SourceForge, under the collex directory in Subversion. Collex implements tagging by implementing JOIN cross-references between user/tag documents and regular object documents. It's scalability is not going to be good at bigger numbers in its current architecture, but it works quite well for our 60k or so objects at the moment. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: document field updates
Erik Hatcher wrote: I'm pretty sure this has been done, I'm just not 100% sure where. Does Nutch index link text? Nutch does do this sort of thing, but I'm not quite sure how. It isn't doing any operations to the Lucene index beyond what plain ol' Lucene does. Nutch maintains a set of separate DBs (using Hadoop MapFile/SequenceFile), where inlinks are stored (together with their anchor text). During indexing this data is pulled in from the DBs piece by piece using the URLs as primary keys. Nutch doesn't update _any_ data structures in-place - all update operations involve creating new data files and optionally deleting old data files. This includes also indexes - new indexes are being created from newly updated pages, and then only individual Lucene documents are deleted from older indexes to get rid of duplicates. After a while, really old indexes are removed completely, because their content is likely to be present in one of the newer indexes. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: document field updates
Collex is quite open source, its just ugly source :) We're the 'patacriticism' project at SourceForge, under the collex directory in Subversion. Collex implements tagging by implementing JOIN cross-references between user/tag documents and regular object documents. It's scalability is not going to be good at bigger numbers in its current architecture, but it works quite well for our 60k or so objects at the moment. Have you implemented and code that enforces a Boolean query across these two indexes? Has anyone implemented a BooleanQuery class that operates across a set of Fields that may live in different Indexes? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Fwd: Re: indexing performance]
On 3/1/07, Saravana [EMAIL PROTECTED] wrote: Is this still hold good now ? Thanks for your reply. Probably most of that still applies to some extent. However, it is unclear whether it will speed up your application. First thing is to find out what your bottleneck is. Looking at the stats on your machine during indexing, is io-bound? cpu-bound? mixed? There are various possible strategies, but they will come from finely-tuning your proceed to meet the bottlenecks you are experiencing. If you are cpu-bound, then perhaps you can use less intensive analyzers, or purchase a multi-cpu machine and index threadedly. If you are i/o bound, you could 1) buy faster disks, 2) use a faster i/o backend (e.g. RAID-0), 3) created indexes on multiple independent disks and merge later. regards, -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: document field updates
On Mar 1, 2007, at 1:35 PM, Neal Richter wrote: Collex is quite open source, its just ugly source :) We're the 'patacriticism' project at SourceForge, under the collex directory in Subversion. Collex implements tagging by implementing JOIN cross-references between user/tag documents and regular object documents. It's scalability is not going to be good at bigger numbers in its current architecture, but it works quite well for our 60k or so objects at the moment. Have you implemented and code that enforces a Boolean query across these two indexes? Actually its a single index, with a type field that separates the two different types of documents (archive objects, or collectable objects). A pointer to this code is here: http:// patacriticism.svn.sourceforge.net/viewvc/patacriticism/collex/trunk/ src/solr/org/nines/CollectableCache.java?view=markup It's a hack that leverages some of Solr's facilities (but not near enough!). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
More long running queries
I'm still having issues with long running queries. I'm using a custom HitCollector to bring back ALL docs that match a search has suggested in a previous post/relpy (e.g. Nutch LuceneQueryOptimizer). This solution works most of the time; however, in testing a very complex query using several range queries and term queries, we're seeing times in the 40 sec range with NO HITS returned. The index contains approx. one million docs and the number of Boolean expressions created is well over 100,000 Tim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Soliciting Design Thoughts on Date Searching
If all you want to do is find docs containing dates within a range, it probably doesn't make much difference whether you give dates their own field or put them into your content field. It'll probably be easier to just add them into the token stream since that's the way the analyzer architecture wants to work (analyzers generally don't know anything about fields.) You can make the position increment work if you want, and it'll make phrase/span queries work better, if you need those to work. What is going to matter in either case is how you format dates. Everything in Lucene is text, so if you want to do date ranges (which you mentioned in your first e-mail), you need to be careful how you format the dates and what kinds of queries you use. See, for example, http://lucene.apache.org/java/docs/api/org/apache/lucene/document/DateTo ols.html (tinyurl: http://tinyurl.com/ejlvx) and http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing (tinyurl: http://tinyurl.com/2pubaq) There are also date filters (as opposed to date queries) that have different tradeoffs. Dates are kinda tricky in Lucene. -Original Message- From: Walt Stoneburner [mailto:[EMAIL PROTECTED] Sent: Thursday, March 01, 2007 7:54 AM To: java-user@lucene.apache.org Subject: Re: Soliciting Design Thoughts on Date Searching Thank you all for the suggestions steering me down the right path. As an aside, the easy part, at least for me, is extracting the dates -- Peter was dead on about how doing that: heuristics, multiple regular expressions, and data structures. As Steve pointed out, this isn't as trivial as it sounds - there are a lot of formats, some ambiguous. I love writing parsers (guess I'm sick in the head, eh?), so getting the data isn't the problem, it's knowing what format to convert it into and how to hand it to Lucene in a way that it'll find meaningful for searching. I had pondered making a single field with a value like: document.add( Field.Text( dates, 27-Feb-1968,04-Jul-1776,01-Mar-2007 )); ...but I wasn't convinced that the Lucene date Range was going to work on anything other than a Date type, rather than a string of text that just coincidently happened to contain dates. Drawing back on my title example, I was under the incorrect impression that if I had a field and provided another value that it replaced the prior value. Hoss is indicating this is not so, and that I'm safe adding additional values. document.add( Field.Text( title, Thanks Thomas )); document.add( Field.Text( title, Thanks Hoss ) ); // Does not stomp on Thomas. Yay! If I can use this technique to pile in a ton of dates, then I'm totally happy, you guys have pointed me in the right direction; celebrations all around. The brain scratcher, for me, was Peter's treating the dates like a synonym -- a clever way of looking at the problem. Unfortunately, that'd be giving me too much credit, as I haven't played with that feature set of Lucene. So, without trying to, Peter's sent me scrambling back to the API for something I wasn't aware was there. Steve adds to the mystery by suggesting a delimited field list, much like the example at the top of this message, and likewise doing some trickery with the token stream and a position increment of zero -- again, a clever solution, and likewise beyond my limited Lucene experience. While I know, intellectually, that Lucene is digesting positioned tokens, it is so well designed that fools like me can legitimately use Lucene for long periods of time without actually being exposed to what's happening under the hood. The ponderance I now contemplate as a newbie (I've downgraded my self assessment after this discussion) is knowing whether the token-stream solution or the multiple-add solution is the pedantic one. Are there performance advantages to one way over the other? I'll be totally stunned if someone offers up that they're logically the same thing. I swear, conversing with you guys is giving me a very deep sense of appreciation for your skills and Lucene's capabilities. -wls - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Field Selector in Searcher interface
What are the odds (or reasons against) bubbling up doc(int, fieldSeclector) to Searcher? I would love to take advantage of the selective field loading but I am working with MultiSearchers and Searchers so I cannot count on getReader (in IndexSearcher) for access. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field Selector in Searcher interface
The odds increase significantly in correlation to patches submitted! :-) The odds increase slightly by at least filing an enhancement issue in JIRA. They increase a tiny bit by bringing it up here! I may have some time in the not too distant future for this, but we always appreciate the help. Looking briefly at the Searchable interface, it does seem to make sense, but that is just my quick glance take on it. -Grant On Mar 1, 2007, at 8:15 PM, Mark Miller wrote: What are the odds (or reasons against) bubbling up doc(int, fieldSeclector) to Searcher? I would love to take advantage of the selective field loading but I am working with MultiSearchers and Searchers so I cannot count on getReader (in IndexSearcher) for access. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: updating index
Doron Cohen wrote: Once indexing the database_id field this way, also the newly added API IndexWriter.updateDocument() may be useful. Whoa, nice convenience method. I don't suppose the new document happens to be given the same ID as the old one. That would make many people's lives much easier. :-) Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: updating index
Daniel Noll [EMAIL PROTECTED] wrote on 01/03/2007 22:10:15: API IndexWriter.updateDocument() may be useful. Whoa, nice convenience method. I don't suppose the new document happens to be given the same ID as the old one. That would make many people's lives much easier. :-) Oh no, this aspect is as it was - the document(s) is deleted, and re-added. However due to the buffering of deletes in IndexWriter, the application no longer needs to take care of batching the deletes for performance considerations - this is taken care of by IndexWriter. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]