Wild carded phrases
Hi all, First of all, well done to the implementers of Lucene. The performance is incredible! We get search results within 20-40 ms on an index about 1.5GB. I could not find a Lucene maillist search engine, something I am a bit surprised about! My question is how I can implement wild carded phrases searches like: boiler replac* This will pick up text: boiler replacement and boiler replacing But not: boiling replacement or boiler user replacment I am using the queryParser through Spring-lucene-module. I did simply try textToSearch= boiler replac*, but this did not work as anticipated. Have not analyzed properly, but it seemed to interpret this as: boiler OR replac* Is there a way to implements this? Many thanks, Jon BiP Solutions Limited is a company registered in Scotland with Company Number SC086146 and VAT number 38303966 and having its registered office at Park House, 300 Glasgow Road, Shawfield, Glasgow, G73 1SQ This e-mail (and any attachment) is intended only for the attention of the addressee(s). Its unauthorised use, disclosure, storage or copying is not permitted. If you are not the intended recipient, please destroyall copies and inform the sender by return e-mail. This e-mail (whether you are the sender or the recipient) may be monitored, recorded and retained by BiP Solutions Ltd. E-mail monitoring/ blocking software may be used, and e-mail content may be read at any time. You have a responsibility to ensure laws are not broken when composing or forwarding e-mails and their contents. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Wild carded phrases
Hi, Here's a searchable mailing list archive: http://www.gossamer-threads.com/lists/lucene/java-user/ As regards the wildcard phrase queries, here's one way I think you could do it, but it's a bit of extra work. If you're using QueryParser, you'd have to override the getFieldQuery method to use span queries instead of phrase queries. A phrase query can be implemented as a span query with a span or slop factor of 1. So, once you have the PhraseQuery object, you would: 1. Extract the terms 2. For each one, check if it contains a * or a ? 3. If it does, create a WildcardQuery using that term, and re-write it using IndexReader.rewrite method. This expands the wildcard query into all it's matches. 4. Create an array of SpanTermQuery objects, (one SpanTermQuery for each term that matched you wildcard); then add that array to a SpanOrQuery. 5. Repeat 2 to 4 for each wildcard term in the phrase. 6. Finally (!), create a SpanNearQuery, adding all the original terms in order, but substituting your SpanOrQuerys for the wildcard terms. Use a slop of 1, and set the inOrder flag to true. So, essentially, you'd end up with: (you'll have to excuse me if I haven't rendered the span queries correctly as strings here - but this should give the general idea...) spanNear[boiler (spanOr[replacement replacing])] So it will accept *either* replacement or replacing adjacent to boiler, which is what you want. As you can see, it's a bit of work - but if you add this functionality to the QueryParser, you'll can re-use it a lot! Hope that helps! -JB Jon Loken wrote: Hi all, First of all, well done to the implementers of Lucene. The performance is incredible! We get search results within 20-40 ms on an index about 1.5GB. I could not find a Lucene maillist search engine, something I am a bit surprised about! My question is how I can implement wild carded phrases searches like: boiler replac* This will pick up text: boiler replacement and boiler replacing But not: boiling replacement or boiler user replacment I am using the queryParser through Spring-lucene-module. I did simply try textToSearch= boiler replac*, but this did not work as anticipated. Have not analyzed properly, but it seemed to interpret this as: boiler OR replac* Is there a way to implements this? Many thanks, Jon BiP Solutions Limited is a company registered in Scotland with Company Number SC086146 and VAT number 38303966 and having its registered office at Park House, 300 Glasgow Road, Shawfield, Glasgow, G73 1SQ This e-mail (and any attachment) is intended only for the attention of the addressee(s). Its unauthorised use, disclosure, storage or copying is not permitted. If you are not the intended recipient, please destroyall copies and inform the sender by return e-mail. This e-mail (whether you are the sender or the recipient) may be monitored, recorded and retained by BiP Solutions Ltd. E-mail monitoring/ blocking software may be used, and e-mail content may be read at any time. You have a responsibility to ensure laws are not broken when composing or forwarding e-mails and their contents. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Updating Lucene Index Dynamically
Hi, I am using Lucene 2.1.0 at the moment and I have huge data which is being indexed. I am re indexing my data on daily basis. Now I would like to index my data dynamically at any point in time. I cannot afford to re index whole data due to its huge size and time it requires. How can I update my index dynamically? Any suggessions? Aamir Yaseen Senior Java Developer Global DataPoint Ltd Middlesex House, 34- 42 Cleveland Street London W1T 4LB, UK T +44 (0)20 7323 0323 Ext: 4829 M +44 (0)7951 895299 www.globaldatapoint.com This e-mail is confidential and should not be used by anyone who is not the original intended recipient. Global DataPoint Limited does not accept liability for any statements made which are clearly the sender's own and not expressly made on behalf of Global DataPoint Limited. No contracts may be concluded on behalf of Global DataPoint Limited by means of e-mail communication. Global DataPoint Limited Registered in England and Wales with registered number 3739752 Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB
RE: lucene farsi problem
Hi Esra, On 05/07/2008 at 11:49 AM, Steven A Rowe wrote: At Chris Hostetter's suggestion, I am rewriting the patch attached to LUCENE-1279, including the following changes: - Merged the contents of the CollatingRangeQuery class into RangeQuery and RangeFilter - Switched the Locale parameter to instead take an instance of Collator - Modified QueryParser.jj to construct a QueryParser class that can accept a range collator and pass it either to RangeQuery or through ConstantScoreRangeQuery to RangeFilter. I have attached the above-described revised patch to LUCENE-1279 - Esra, if you get a chance, could you try it out? The implementation hasn't changed (except for the cosmetic changes noted above) -- you'll just be using RangeQuery instead of CollatingRangeQuery. Thanks, Steve
theoretical maximum score
Is it possible to compute a theoretical maximum score for a given query if constraints are placed on 'tf' and 'lengthNorm'? If so, scores could be compared to a 'perfect score' (a feature request from our customers) Here are some related threads on this: In this thread: http://www.nabble.com/Newbie-questions-re%3A-scoring-td4228776.html#a4228776 Hoss writes: the only way I can think of to fairly compare scores from queries for foo:bar with queries for yak:baz is to normalize them relative a maximum possible score across the entire term query space -- but finding that maximum is a pretty complicated problem just for simple term queries ... when you start talking about more complicated query structures you really get messy -- and even then it's only fair as long as the query structures are identical, you can never compare the scores from apples and oranges And in this thread: http://www.nabble.com/non-relative-scoring-td8956299.html#a8956299 Walt writes: A tf.idf engine, like Lucene, might not have a maximum score. What if a document contains the word a thousand times? A million times? It seems that if 'tf' is limited to a max value and 'lengthNorm' is a constant, it might be possible, at least for 'simple' term queries. But Hoss says that things get messing with complicated queries. Could someone elaborate a bit? Does the index contain enough info to do this efficiently? I realize that scores values must be interpreted 'carefully', but I'm seeing a push to get more leverage from the absolute values, not just the relative values. Peter
Re: Updating Lucene Index Dynamically
See the IndexModifier class. This assumes that by dynamically modify you mean change existing documents. If all you're doing is adding new documents, you can freely add new docs to an existing index. There's a parameter on IndexWriter that determines whether your index is opened for appending or overwritten. If these don't work for you, perhaps you could explain more about how your data changes so better suggestions can be offered. Best Erick On Fri, May 9, 2008 at 8:05 AM, [EMAIL PROTECTED] wrote: Hi, I am using Lucene 2.1.0 at the moment and I have huge data which is being indexed. I am re indexing my data on daily basis. Now I would like to index my data dynamically at any point in time. I cannot afford to re index whole data due to its huge size and time it requires. How can I update my index dynamically? Any suggessions? Aamir Yaseen Senior Java Developer Global DataPoint Ltd Middlesex House, 34- 42 Cleveland Street London W1T 4LB, UK T +44 (0)20 7323 0323 Ext: 4829 M +44 (0)7951 895299 www.globaldatapoint.com This e-mail is confidential and should not be used by anyone who is not the original intended recipient. Global DataPoint Limited does not accept liability for any statements made which are clearly the sender's own and not expressly made on behalf of Global DataPoint Limited. No contracts may be concluded on behalf of Global DataPoint Limited by means of e-mail communication. Global DataPoint Limited Registered in England and Wales with registered number 3739752 Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB
Using stored fields for scoring
Hi all, I am looking for a way to include a stored (non-indexed) field in the computation of scores for a query. I have tried using a ValueSourceQuery with a ValueSource subclass that simply retrieves the document and gets the field, like: public float floatVal(int doc) { reader.document(doc, selector).getBinaryValue(myfield); } but that's too slow, because it ends up doing a lookup for each matching document. Is it possible to use a stored field in a FunctionQuery or ValueSourceQuery in an efficient way (i.e. not dependent on the number of retrieved documents)? If the answer is yes, is it possible to update such a value in place without removing and reindexing the document? Thanks in advance. Paolo Capriotti - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using stored fields for scoring
Well, all things are possible G But I don't think there's a way to get the field from each document at scoring time efficiently. It looks like you're already lazy-loading the field, which was going to be my suggestion. You could get it much faster if you *did* index it (UN_TOKENIZED?) and went after it with TermDocs/TermEnum. So what is the nature of the field you're using? Is it possible to build up the list of doc-binaryfield at, say, startup time and just use a map or some such? You could even think about putting all the binary data in your index in a special document that had a field(s) orthogonal to all other document. Essentially take the map I suggested earlier and stuff it in a doc with one field (say, MySpecialMapField). Then read *that* document in at startup (or even search time) to get your binary field for scoring. All this pre-supposes that your binary field/doc_id map will fit in memory What about index-time boosting? This only does you good if your binary data above is some sort of importance ranking. Index time boosting says something like This document title is more important than normal so this would *automatically* affect your scoring. You'd have to apply the index-time boosts selectively to the fields you want And if none of this is relevant, could you expand a bit more on what you're trying to do? What is the nature and purpose of your field you want to use to influence scoring? Best Erick On Fri, May 9, 2008 at 10:07 AM, Paolo Capriotti [EMAIL PROTECTED] wrote: Hi all, I am looking for a way to include a stored (non-indexed) field in the computation of scores for a query. I have tried using a ValueSourceQuery with a ValueSource subclass that simply retrieves the document and gets the field, like: public float floatVal(int doc) { reader.document(doc, selector).getBinaryValue(myfield); } but that's too slow, because it ends up doing a lookup for each matching document. Is it possible to use a stored field in a FunctionQuery or ValueSourceQuery in an efficient way (i.e. not dependent on the number of retrieved documents)? If the answer is yes, is it possible to update such a value in place without removing and reindexing the document? Thanks in advance. Paolo Capriotti - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Is this the right way to use Lucene in multithread env?
Hi, Here is what I am using Lucene. I build the index (from different data source) during midnight. I build a FSDirectory. Then I load it into RAMDirectory for the best performance. When I built it, I called IndexWriter.optimize() once. Once the index is built, I will never update it. I have static variable defined as IndexSearcher. Once I load RAMDirectory, I do newIndexDirectory = new RAMDirectory(fsDirectory); IndexWriter newWriter = new IndexWriter(newIndexDirectory, new StandardAnalyzer(), true); newWriter.optimize(); newWriter.close(); searcher = new IndexSearcher(newIndexDirectory ); For every new search, I do QueryParser parser = new QueryParser(field1, new StandardAnalyzer()); Query query = parser.parse(queryString); Hits hits = searcher.search(query); Is this the right way? Do I need to close parse, query or hits? As I have only one IndexSearcher, will it cause any problem? I found using the same query does not always give me the same response time. Thanks much. -- View this message in context: http://www.nabble.com/Is-this-the-right-way-to-use-Lucene-in-multithread-env--tp17150728p17150728.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is this the right way to use Lucene in multithread env?
Hi, No need to close parse and it is good to use the same searcher. I don't understand why you have that IndexWriter there if you are searching... Also, you may not benefit from explicit loading of the index into RAM. Try without it first. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: wolvernie88 [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Friday, May 9, 2008 11:38:43 AM Subject: Is this the right way to use Lucene in multithread env? Hi, Here is what I am using Lucene. I build the index (from different data source) during midnight. I build a FSDirectory. Then I load it into RAMDirectory for the best performance. When I built it, I called IndexWriter.optimize() once. Once the index is built, I will never update it. I have static variable defined as IndexSearcher. Once I load RAMDirectory, I do newIndexDirectory = new RAMDirectory(fsDirectory); IndexWriter newWriter = new IndexWriter(newIndexDirectory, new StandardAnalyzer(), true); newWriter.optimize(); newWriter.close(); searcher = new IndexSearcher(newIndexDirectory ); For every new search, I do QueryParser parser = new QueryParser(field1, new StandardAnalyzer()); Query query = parser.parse(queryString); Hits hits = searcher.search(query); Is this the right way? Do I need to close parse, query or hits? As I have only one IndexSearcher, will it cause any problem? I found using the same query does not always give me the same response time. Thanks much. -- View this message in context: http://www.nabble.com/Is-this-the-right-way-to-use-Lucene-in-multithread-env--tp17150728p17150728.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is this the right way to use Lucene in multithread env?
Hi, I am creating a new indexWriter to optimize the directory. I will try using FSDirectory later. Otis Gospodnetic wrote: Hi, No need to close parse and it is good to use the same searcher. I don't understand why you have that IndexWriter there if you are searching... Also, you may not benefit from explicit loading of the index into RAM. Try without it first. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: wolvernie88 [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Friday, May 9, 2008 11:38:43 AM Subject: Is this the right way to use Lucene in multithread env? Hi, Here is what I am using Lucene. I build the index (from different data source) during midnight. I build a FSDirectory. Then I load it into RAMDirectory for the best performance. When I built it, I called IndexWriter.optimize() once. Once the index is built, I will never update it. I have static variable defined as IndexSearcher. Once I load RAMDirectory, I do newIndexDirectory = new RAMDirectory(fsDirectory); IndexWriter newWriter = new IndexWriter(newIndexDirectory, new StandardAnalyzer(), true); newWriter.optimize(); newWriter.close(); searcher = new IndexSearcher(newIndexDirectory ); For every new search, I do QueryParser parser = new QueryParser(field1, new StandardAnalyzer()); Query query = parser.parse(queryString); Hits hits = searcher.search(query); Is this the right way? Do I need to close parse, query or hits? As I have only one IndexSearcher, will it cause any problem? I found using the same query does not always give me the same response time. Thanks much. -- View this message in context: http://www.nabble.com/Is-this-the-right-way-to-use-Lucene-in-multithread-env--tp17150728p17150728.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/Is-this-the-right-way-to-use-Lucene-in-multithread-env--tp17150728p17152621.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fwd: Snowball not finding purple
For some reason it seems that either Lucene or Snowball has a problem with the color purple. According the snowball experts the problem is with lucene. Can anyone shed any light? Thanks, Steve -- Forwarded message -- From: Stephen Cresswell [EMAIL PROTECTED] Date: 2008/4/22 Subject: Snowball not finding purple To: [EMAIL PROTECTED] Hi, I'm using Compass/Lucene + snowball/English to search the following text which appears in several documents The road was a ribbon of moonlight looping the purple moor, Searching for the word ribbon returns the document, but not the word purple [945354] compass.DefaultSearchableMethodFactory search defaults: {max=10, offset=0, reload=false, escape=false} [945397] search.DefaultSearchMethod query: [+(+(name:ribbon^8.0 firstMessageText:ribbon^0.0 text:ribbon)) +(alias:ALIASConversationALIAS)], [4] hits, took [2] millis [956176] compass.DefaultSearchableMethodFactory search defaults: {max=10, offset=0, reload=false, escape=false} [956184] search.DefaultSearchMethod query: [+(+(name:purple^8.0 firstMessageText:purple^0.0 text:purple)) +(alias:ALIASConversationALIAS)], [0] hits, took [1] millis If the only change I make is to switch to Lucene's StandardAnalyzer results for both ribbon and purple are returned Is this a bug or is there some strange intended behavior I'm not aware of? Thanks Steve