Re: How can we know if 2 lucene indexes are same?

2008-09-04 Thread 叶双明
No documents can added into index when the index is optimizing, or optimizing can't run durling documents adding to the index. So, without other error, I think we can beleive the two index are indeed the same. :) 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् [EMAIL PROTECTED] The use case is as follows

Re: delete/reset the index

2008-09-04 Thread Michael McCandless
If you're on Windows, the safest way to do this in general, if there is any possibility that readers are still using the index, is to create a new IndexWriter with create=true. Windows does not let you remove open files. IndexWriter will gracefully handle failed deletes by retrying

Re: getTimestamp method in IndexCommit

2008-09-04 Thread Noble Paul നോബിള്‍ नोब्ळ्
YOU ARE FAST thanks. --Noble On Thu, Sep 4, 2008 at 2:54 PM, Michael McCandless [EMAIL PROTECTED] wrote: Noble Paul നോബിള്‍ नोब्ळ् wrote: On Wed, Sep 3, 2008 at 2:06 PM, Michael McCandless [EMAIL PROTECTED] wrote: Noble Paul നോബിള്‍ नोब्ळ् wrote: On Tue, Sep 2, 2008 at 1:56 PM, Michael

Re: getTimestamp method in IndexCommit

2008-09-04 Thread Michael McCandless
Thanks for raising it! It's through requests like this that Lucene's API improves. Mike Noble Paul നോബിള്‍ नोब्ळ् wrote: YOU ARE FAST thanks. --Noble On Thu, Sep 4, 2008 at 2:54 PM, Michael McCandless [EMAIL PROTECTED] wrote: Noble Paul നോബിള്‍ नोब्ळ् wrote: On Wed, Sep 3, 2008 at

Re: delete/reset the index

2008-09-04 Thread 叶双明
Agree with Michael McCandless!! By that way,it is handling gracefully. 2008/9/4 Michael McCandless [EMAIL PROTECTED] If you're on Windows, the safest way to do this in general, if there is any possibility that readers are still using the index, is to create a new IndexWriter with

string similarity measures

2008-09-04 Thread Cam Bazz
Hello, This came up before but - if we were to make a swear word filter, string edit distances are no good. for example words like `shot` is confused with `shit`. there is also problem with words like hitchcock. appearently i need something like soundex or double metaphone. the thing is - these

Re: Realtime Search for Social Networks Collaboration

2008-09-04 Thread Cam Bazz
Hello Jason, I have been trying to do this for a long time on my own. keep up the good work. What I tried was a document cache using apache collections. and before a indexwrite/delete i would sync the cache with index. I am waiting for lucene 2.4 to proceed. (query by delete) Best. On Wed, Sep

Re: string similarity measures

2008-09-04 Thread Karl Wettin
4 sep 2008 kl. 14.38 skrev Cam Bazz: Hello, This came up before but - if we were to make a swear word filter, string edit distances are no good. for example words like `shot` is confused with `shit`. there is also problem with words like hitchcock. appearently i need something like

Re: string similarity measures

2008-09-04 Thread Karl Wettin
4 sep 2008 kl. 15.54 skrev Cam Bazz: yes, I already have a system for users reporting words. they fall on an operator screen and if operator approves, or if 3 other people marked it as curse, then it is filtered. in the other thread you wrote: I would create 1-5 ngram sized shingles and

Re: How can we know if 2 lucene indexes are same?

2008-09-04 Thread Michael McCandless
Sorry, I should have said: you must always use the same writer, ie as of 2.3, while IndexWriter.optimize (or normal segment merging) is running, under one thread, another thread can use that *same* writer to add/delete/update documents, and both are free to make changes to the index.

Re: string similarity measures

2008-09-04 Thread Cam Bazz
let me rephrase the problem. I already have a set of bad words. I want to avoid people inputting typos of the bad words. for example 'shit' is banned, but someone may enter sh1t. how can i flag those phonetically similar bad words to the marked bad words? Best. On Thu, Sep 4, 2008 at 5:02 PM,

lucene ram buffering

2008-09-04 Thread Cam Bazz
hello, I was reading the performance optimization guides then I found : writer.setRAMBufferSizeMB() combined with: writer.setMaxBufferedDocs(IndexWriter.DISABLE_AUTO_FLUSH); this can be used to flush automatically so if the ram buffer size is over a certain limit it will flush. now the

Re: How can we know if 2 lucene indexes are same?

2008-09-04 Thread 叶双明
I see now, thanks Michael McCandless, good explain!! 2008/9/4, Michael McCandless [EMAIL PROTECTED]: Sorry, I should have said: you must always use the same writer, ie as of 2.3, while IndexWriter.optimize (or normal segment merging) is running, under one thread, another thread can use that

Re: string similarity measures

2008-09-04 Thread mathieu
I submitted a patch to handle Aspell phonetic rules. You can find it in JIRA. On Thu, 4 Sep 2008 17:07:09 +0300, Cam Bazz [EMAIL PROTECTED] wrote: let me rephrase the problem. I already have a set of bad words. I want to avoid people inputting typos of the bad words. for example 'shit' is

ramdisks

2008-09-04 Thread Cam Bazz
hello, anyone using ramdisks for storage? there is ramsam and there is also fusion io. but they are kinda expensive. any other alternatives I wonder? Best.

Re: Newbie question: using Lucene to index hierarchical information.

2008-09-04 Thread Leonid Maslov
Hi all, Thanks a lot for such a quick reply. Both scenario sounds very well for me. I would like to do my best and try to implement any of them (as the proof of the concept) and then incrementally improve, retest, investigate and rewrite then :) So, from the soap opera to the question part then:

Lucene debug logging?

2008-09-04 Thread Justin Grunau
Is there a way to turn on debug logging / trace logging for Lucene? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

2008-09-04 Thread Justin Grunau
We have some code that uses lucene which has been working perfectly well for several months. Recently, a QA team in our organization has set up a server with a much larger data set than we have ever tested with in the past: the resulting lucene index is about 3G in size. On this particular

Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

2008-09-04 Thread Leonid M.
* And what's about visibility filter? * Are you sure no one else accesses IndexReader and modifies index? See reader.maxDocs() to be confident. On Fri, Sep 5, 2008 at 12:19 AM, Justin Grunau [EMAIL PROTECTED] wrote: We have some code that uses lucene which has been working perfectly well for

Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

2008-09-04 Thread Justin Grunau
Sorry, I forgot to include the visibility filters: final BooleanQuery visibilityFilter = new BooleanQuery(); visibilityFilter.add(new TermQuery(new Term(isPublic, true)), Occur.SHOULD); visibilityFilter.add(new TermQuery(new

Re: Lucene debug logging?

2008-09-04 Thread Daniel Naber
On Donnerstag, 4. September 2008, Justin Grunau wrote: Is there a way to turn on debug logging / trace logging for Lucene? You can use IndexWriter's setInfoStream(). Besides that, Lucene doesn't do any logging AFAIK. Are you experiencing any problems that you want to diagnose with debugging?

Re: Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds

2008-09-04 Thread Leonid M.
Anyway it is worth trying (to ensure docs aren't removed between searches).What if running MatchAllDocsQuery or smth similar? Still getting different hits count on query rerun? PS. I'm kinda newbie with Lucene and Lucene API. So don't take my notes too seriously :) On Fri, Sep 5, 2008 at 12:46

Re: Lucene debug logging?

2008-09-04 Thread Justin Grunau
Daniel, yes, please see my Problem with lucene search starting to return 0 hits when a few seconds earlier it was returning hundreds thread. - Original Message From: Daniel Naber [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, September 4, 2008 6:10:56 PM Subject:

Re: PhraseQuery issues - differences with SpanNearQuery

2008-09-04 Thread Paul Elschot
Op Thursday 04 September 2008 20:39:13 schreef Mark Miller: Sounds like its more in line with what you are looking for. If I remember correctly, the phrase query factors in the edit distance in scoring, but the NearSpanQuery will just use the combined idf for each of the terms in it, so

Re: QueryParser vs. BooleanQuery

2008-09-04 Thread 叶双明
Indeed, StandardAnalyzer removing the pluses, so analyse 'c++' to 'c'. QueryParser include Term that been analysed. And BooleanQuery include Term that hasn't been analysed. I think this is the difference between they. 2008/9/4 Ian Lea [EMAIL PROTECTED] Have a look at the index with Luke to

Re: Beginner: Specific indexing

2008-09-04 Thread Chris Hostetter
Honestly: your problem doesn't sound like a Lucene problem to me at all ... i would write custom code to cehck your files for the pattern you are looking for. if you find it *then* construct a Document object, and add your 3 fields. I probably wouldn't even use an analyzer. -Hoss

Javadoc wording in IndexWriter.addIndexesNoOptimize()

2008-09-04 Thread Antony Bowesman
The Javadoc for this method has the following comment: This requires this index not be among those to be added, and the upper bound* of those segment doc counts not exceed maxMergeDocs. What does the second part of that mean, which is especially confusing given that MAX_MERGE_DOCS is

Re: Hits document offset information

2008-09-04 Thread Chris Hostetter
: Now, I would like to to access to the best fragments offsetsfrom each : document (hits.doc(i)). I seem to recall that the recomended method for doing this is to subclass your favorite Formatter and record the information from each TokenGroup before delegating to the super class. but there

Merging indexes - which is best option?

2008-09-04 Thread Antony Bowesman
I am creating several temporary batches of indexes to separate indices and periodically will merge those batches to a set of master indices. I'm using IndexWriter#addIndexesNoOptimise(), but problem that gives me is that the master may already contain the index for that document and I get a