Wild carded phrases

2008-05-09 Thread Jon Loken
Hi all, 

First of all, well done to the implementers of Lucene. The performance
is incredible! We get search results within 20-40 ms on an index about
1.5GB. 

I could not find a Lucene maillist search engine, something I am a bit
surprised about!

My question is how I can implement wild carded phrases searches like:
boiler replac*
This will pick up text:
boiler replacement and boiler replacing
But not:
boiling replacement or boiler user replacment

I am using the queryParser through Spring-lucene-module. 



I did simply try textToSearch= boiler replac*, but this did not work
as anticipated. Have not analyzed properly, but it seemed to interpret
this as: 
boiler OR replac*


Is there a way to implements this?

Many thanks, 
Jon


BiP Solutions Limited is a company registered in Scotland with Company Number 
SC086146 and VAT number 38303966 and having its registered office at Park 
House, 300 Glasgow Road, Shawfield, Glasgow, G73 1SQ 

This e-mail (and any attachment) is intended only for the attention of the 
addressee(s). Its unauthorised use, disclosure, storage or copying is not 
permitted. If you are not the intended recipient, please destroyall copies and 
inform the sender by return e-mail.
This e-mail (whether you are the sender or the recipient) may be monitored, 
recorded and retained by BiP Solutions Ltd.
E-mail monitoring/ blocking software may be used, and e-mail content may be 
read at any time. You have a responsibility to ensure laws are not broken when 
composing or forwarding e-mails and their contents.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Wild carded phrases

2008-05-09 Thread John Byrne

Hi,

Here's a searchable mailing list archive: 
http://www.gossamer-threads.com/lists/lucene/java-user/


As regards the wildcard phrase queries, here's one way I think you could 
do it, but it's a bit of extra work. If you're using QueryParser, you'd 
have to override the getFieldQuery method to use span queries instead 
of phrase queries.


A phrase query can be implemented as a span query with a span or slop  
factor of 1. So, once you have the PhraseQuery object, you would:


1. Extract the terms
2. For each one, check if it contains a * or a ?
3. If it does, create a WildcardQuery using that term, and re-write it 
using IndexReader.rewrite method. This expands the wildcard query into 
all it's matches.
4. Create an array of SpanTermQuery objects, (one SpanTermQuery for each 
term that matched you wildcard); then add that array to a SpanOrQuery.

5. Repeat 2 to 4 for each wildcard term in the phrase.
6. Finally (!), create a SpanNearQuery, adding all the original terms in 
order, but substituting your SpanOrQuerys for the wildcard terms. Use a 
slop of 1, and set the inOrder flag to true.


So, essentially, you'd end up with: (you'll have to excuse me if I 
haven't rendered the span queries correctly as strings here - but this 
should give the general idea...)


spanNear[boiler (spanOr[replacement replacing])]

So it will accept *either* replacement or replacing adjacent to 
boiler, which is what you want.


As you can see, it's a bit of work - but if you add this functionality 
to the QueryParser, you'll can re-use it a lot!


Hope that helps!

-JB








Jon Loken wrote:
Hi all, 


First of all, well done to the implementers of Lucene. The performance
is incredible! We get search results within 20-40 ms on an index about
1.5GB. 


I could not find a Lucene maillist search engine, something I am a bit
surprised about!

My question is how I can implement wild carded phrases searches like:
boiler replac*
This will pick up text:
boiler replacement and boiler replacing
But not:
boiling replacement or boiler user replacment

I am using the queryParser through Spring-lucene-module. 




I did simply try textToSearch= boiler replac*, but this did not work
as anticipated. Have not analyzed properly, but it seemed to interpret
this as: 
boiler OR replac*



Is there a way to implements this?

Many thanks, 
Jon



BiP Solutions Limited is a company registered in Scotland with Company Number 
SC086146 and VAT number 38303966 and having its registered office at Park 
House, 300 Glasgow Road, Shawfield, Glasgow, G73 1SQ 

This e-mail (and any attachment) is intended only for the attention of the 
addressee(s). Its unauthorised use, disclosure, storage or copying is not 
permitted. If you are not the intended recipient, please destroyall copies and 
inform the sender by return e-mail.
This e-mail (whether you are the sender or the recipient) may be monitored, 
recorded and retained by BiP Solutions Ltd.
E-mail monitoring/ blocking software may be used, and e-mail content may be 
read at any time. You have a responsibility to ensure laws are not broken when 
composing or forwarding e-mails and their contents.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Updating Lucene Index Dynamically

2008-05-09 Thread Aamir.Yaseen
Hi,

I am using Lucene 2.1.0 at the moment and I have huge data which is
being indexed.

I am re indexing my data on daily basis. Now I would like to index my
data dynamically at any point in time. 

I cannot afford to re index whole data due to its huge size and time it
requires.

 

How can I update my index dynamically? Any suggessions?

 

Aamir Yaseen 
Senior Java Developer

 

Global DataPoint Ltd
Middlesex House, 34- 42 Cleveland Street 
London W1T 4LB, UK

T +44 (0)20 7323 0323 Ext: 4829

M +44 (0)7951 895299

www.globaldatapoint.com

 


 
This e-mail is confidential and should not be used by anyone who is not the 
original intended recipient. Global DataPoint Limited does not accept liability 
for any statements made which are clearly the sender's own and not expressly 
made on behalf of Global DataPoint Limited. No contracts may be concluded on 
behalf of Global DataPoint Limited by means of e-mail communication. Global 
DataPoint Limited Registered in England and Wales with registered number 
3739752 Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 
4LB



RE: lucene farsi problem

2008-05-09 Thread Steven A Rowe
Hi Esra,

On 05/07/2008 at 11:49 AM, Steven A Rowe wrote:
 At Chris Hostetter's suggestion, I am rewriting the patch
 attached to LUCENE-1279, including the following changes:
 
 - Merged the contents of the CollatingRangeQuery class into
 RangeQuery and RangeFilter
 - Switched the Locale parameter to instead take an instance
 of Collator
 - Modified QueryParser.jj to construct a QueryParser class
 that can accept a range collator and pass it either to
 RangeQuery or through ConstantScoreRangeQuery to RangeFilter.

I have attached the above-described revised patch to LUCENE-1279 - Esra, if you 
get a chance, could you try it out?  The implementation hasn't changed (except 
for the cosmetic changes noted above) -- you'll just be using RangeQuery 
instead of CollatingRangeQuery.

Thanks,
Steve


theoretical maximum score

2008-05-09 Thread Peter Keegan
Is it possible to compute a theoretical maximum score for a given query if
constraints are placed on 'tf' and 'lengthNorm'? If so, scores could be
compared to a 'perfect score' (a feature request from our customers)

Here are some related threads on this:

In this thread:

http://www.nabble.com/Newbie-questions-re%3A-scoring-td4228776.html#a4228776

Hoss writes:

 the only way I can think of to fairly compare scores from queries for
 foo:bar with queries for yak:baz is to normalize them relative a maximum
 possible score across the entire term query space -- but finding that
 maximum is a pretty complicated problem just for simple term queries ...
 when you start talking about more complicated query structures you really
 get messy -- and even then it's only fair as long as the query structures
 are identical, you can never compare the scores from apples and oranges

And in this thread:

http://www.nabble.com/non-relative-scoring-td8956299.html#a8956299

Walt writes:

 A tf.idf engine, like Lucene, might not have a maximum score.
 What if a document contains the word a thousand times?
 A million times?

It seems that if 'tf' is limited to a max value and 'lengthNorm' is a
constant, it might be possible, at least for 'simple' term queries. But Hoss
says that things get messing with complicated queries.

Could someone elaborate a bit? Does the index contain enough info to do this
efficiently?
I realize that scores values must be interpreted 'carefully', but I'm seeing
a push to get more leverage from the absolute values, not just the relative
values.

Peter


Re: Updating Lucene Index Dynamically

2008-05-09 Thread Erick Erickson
See the IndexModifier class. This assumes that by dynamically modify you
mean
change existing documents.

If all you're doing is adding new documents, you can freely add new docs to
an
existing index. There's a parameter on IndexWriter that determines whether
your index is opened for appending or overwritten.

If these don't work for you, perhaps you could explain more about how your
data changes so better suggestions can be offered.

Best
Erick

On Fri, May 9, 2008 at 8:05 AM, [EMAIL PROTECTED] wrote:

 Hi,

 I am using Lucene 2.1.0 at the moment and I have huge data which is
 being indexed.

 I am re indexing my data on daily basis. Now I would like to index my
 data dynamically at any point in time.

 I cannot afford to re index whole data due to its huge size and time it
 requires.



 How can I update my index dynamically? Any suggessions?



 Aamir Yaseen
 Senior Java Developer



 Global DataPoint Ltd
 Middlesex House, 34- 42 Cleveland Street
 London W1T 4LB, UK

 T +44 (0)20 7323 0323 Ext: 4829

 M +44 (0)7951 895299

 www.globaldatapoint.com





 This e-mail is confidential and should not be used by anyone who is not the
 original intended recipient. Global DataPoint Limited does not accept
 liability for any statements made which are clearly the sender's own and not
 expressly made on behalf of Global DataPoint Limited. No contracts may be
 concluded on behalf of Global DataPoint Limited by means of e-mail
 communication. Global DataPoint Limited Registered in England and Wales with
 registered number 3739752 Registered Office Middlesex House, 34-42 Cleveland
 Street, London W1T 4LB




Using stored fields for scoring

2008-05-09 Thread Paolo Capriotti

Hi all,
I am looking for a way to include a stored (non-indexed) field in the 
computation of scores for a query.
I have tried using a ValueSourceQuery with a ValueSource subclass that 
simply retrieves the document and gets the field, like:


public float floatVal(int doc) {
  reader.document(doc, selector).getBinaryValue(myfield);
  
}

but that's too slow, because it ends up doing a lookup for each matching 
document.
Is it possible to use a stored field in a FunctionQuery or 
ValueSourceQuery in an efficient way (i.e. not dependent on the number 
of retrieved documents)?
If the answer is yes, is it possible to update such a value in place 
without removing and reindexing the document?


Thanks in advance.

Paolo Capriotti

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using stored fields for scoring

2008-05-09 Thread Erick Erickson
Well, all things are possible G But I don't think there's a way to get
the field from each document at scoring time efficiently. It looks like
you're already lazy-loading the field, which was going to be my suggestion.

You could get it much faster if you *did* index it (UN_TOKENIZED?) and
went after it with TermDocs/TermEnum.

So what is the nature of the field you're using? Is it possible to build
up the list of doc-binaryfield at, say, startup time and just use a
map or some such?

You could even think about putting all the binary data in your
 index in a special document that had a field(s) orthogonal to
all other document. Essentially take the map I suggested
earlier and stuff it in a doc with one field (say,
MySpecialMapField). Then read *that* document in
at startup (or even search time) to get your binary field for
scoring.

All this pre-supposes that your binary field/doc_id map will fit
in memory

What about index-time boosting? This only does you good
if your binary data above is some sort of importance ranking.
Index time boosting says something like This document title
is more important than normal so this would *automatically*
affect your scoring. You'd have to apply the index-time boosts
selectively to the fields you want

And if none of this is relevant, could you expand a bit more on what
you're trying to do? What is the nature and purpose of your
field you want to use to influence scoring?

Best
Erick

On Fri, May 9, 2008 at 10:07 AM, Paolo Capriotti [EMAIL PROTECTED]
wrote:

 Hi all,
 I am looking for a way to include a stored (non-indexed) field in the
 computation of scores for a query.
 I have tried using a ValueSourceQuery with a ValueSource subclass that
 simply retrieves the document and gets the field, like:

 public float floatVal(int doc) {
  reader.document(doc, selector).getBinaryValue(myfield);
  
 }

 but that's too slow, because it ends up doing a lookup for each matching
 document.
 Is it possible to use a stored field in a FunctionQuery or ValueSourceQuery
 in an efficient way (i.e. not dependent on the number of retrieved
 documents)?
 If the answer is yes, is it possible to update such a value in place
 without removing and reindexing the document?

 Thanks in advance.

 Paolo Capriotti

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Is this the right way to use Lucene in multithread env?

2008-05-09 Thread wolvernie88

Hi,

Here is what I am using Lucene.

I build the index (from different data source) during midnight. I build a
FSDirectory. Then I load it into RAMDirectory for the best performance. When
I built it, I called IndexWriter.optimize() once. 

Once the index is built, I will never update it.

I have static variable defined as IndexSearcher. Once I load RAMDirectory, I
do

newIndexDirectory = new RAMDirectory(fsDirectory);
IndexWriter newWriter = new 
IndexWriter(newIndexDirectory, new
StandardAnalyzer(), true);
newWriter.optimize();
newWriter.close();
searcher = new IndexSearcher(newIndexDirectory );

For every new search, I do

QueryParser parser = new QueryParser(field1, new 
StandardAnalyzer());
Query query = parser.parse(queryString);
Hits hits = searcher.search(query);

Is this the right way? Do I need to close parse, query or hits?

As I have only one IndexSearcher, will it cause any problem?

I found  using the same query does not always give me the same response
time.

Thanks much.
-- 
View this message in context: 
http://www.nabble.com/Is-this-the-right-way-to-use-Lucene-in-multithread-env--tp17150728p17150728.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is this the right way to use Lucene in multithread env?

2008-05-09 Thread Otis Gospodnetic
Hi,

No need to close parse and it is good to use the same searcher.
I don't understand why you have that IndexWriter there if you are searching...

Also, you may not benefit from explicit loading of the index into RAM.  Try 
without it first.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: wolvernie88 [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Friday, May 9, 2008 11:38:43 AM
 Subject: Is this the right way to use Lucene in multithread env?
 
 
 Hi,
 
 Here is what I am using Lucene.
 
 I build the index (from different data source) during midnight. I build a
 FSDirectory. Then I load it into RAMDirectory for the best performance. When
 I built it, I called IndexWriter.optimize() once. 
 
 Once the index is built, I will never update it.
 
 I have static variable defined as IndexSearcher. Once I load RAMDirectory, I
 do
 
 newIndexDirectory = new RAMDirectory(fsDirectory);
 IndexWriter newWriter = new IndexWriter(newIndexDirectory, new
 StandardAnalyzer(), true);
 newWriter.optimize();
 newWriter.close();
 searcher = new IndexSearcher(newIndexDirectory );
 
 For every new search, I do
 
 QueryParser parser = new QueryParser(field1, new 
 StandardAnalyzer());
 Query query = parser.parse(queryString);
 Hits hits = searcher.search(query);
 
 Is this the right way? Do I need to close parse, query or hits?
 
 As I have only one IndexSearcher, will it cause any problem?
 
 I found  using the same query does not always give me the same response
 time.
 
 Thanks much.
 -- 
 View this message in context: 
 http://www.nabble.com/Is-this-the-right-way-to-use-Lucene-in-multithread-env--tp17150728p17150728.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Is this the right way to use Lucene in multithread env?

2008-05-09 Thread wolvernie88

Hi,

I am creating a new indexWriter to optimize the directory.

I will try using FSDirectory later.



Otis Gospodnetic wrote:
 
 Hi,
 
 No need to close parse and it is good to use the same searcher.
 I don't understand why you have that IndexWriter there if you are
 searching...
 
 Also, you may not benefit from explicit loading of the index into RAM. 
 Try without it first.
 
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 - Original Message 
 From: wolvernie88 [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Friday, May 9, 2008 11:38:43 AM
 Subject: Is this the right way to use Lucene in multithread env?
 
 
 Hi,
 
 Here is what I am using Lucene.
 
 I build the index (from different data source) during midnight. I build a
 FSDirectory. Then I load it into RAMDirectory for the best performance.
 When
 I built it, I called IndexWriter.optimize() once. 
 
 Once the index is built, I will never update it.
 
 I have static variable defined as IndexSearcher. Once I load
 RAMDirectory, I
 do
 
 newIndexDirectory = new RAMDirectory(fsDirectory);
 IndexWriter newWriter = new IndexWriter(newIndexDirectory,
 new
 StandardAnalyzer(), true);
 newWriter.optimize();
 newWriter.close();
 searcher = new IndexSearcher(newIndexDirectory );
 
 For every new search, I do
 
 QueryParser parser = new QueryParser(field1, new 
 StandardAnalyzer());
 Query query = parser.parse(queryString);
 Hits hits = searcher.search(query);
 
 Is this the right way? Do I need to close parse, query or hits?
 
 As I have only one IndexSearcher, will it cause any problem?
 
 I found  using the same query does not always give me the same response
 time.
 
 Thanks much.
 -- 
 View this message in context: 
 http://www.nabble.com/Is-this-the-right-way-to-use-Lucene-in-multithread-env--tp17150728p17150728.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Is-this-the-right-way-to-use-Lucene-in-multithread-env--tp17150728p17152621.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fwd: Snowball not finding purple

2008-05-09 Thread Stephen Cresswell
For some reason it seems that either Lucene or Snowball has a problem with
the color purple. According the snowball experts the problem is with lucene.
Can anyone shed any light? Thanks,

Steve

-- Forwarded message --
From: Stephen Cresswell [EMAIL PROTECTED]
Date: 2008/4/22
Subject: Snowball not finding purple
To: [EMAIL PROTECTED]


Hi,

I'm using Compass/Lucene + snowball/English to search the following text
which appears in several documents

The road was a ribbon of moonlight looping the purple moor,

Searching for the word ribbon returns the document, but not the word
purple

[945354] compass.DefaultSearchableMethodFactory search defaults: {max=10,
offset=0, reload=false, escape=false}
[945397] search.DefaultSearchMethod query: [+(+(name:ribbon^8.0
firstMessageText:ribbon^0.0 text:ribbon)) +(alias:ALIASConversationALIAS)],
[4] hits, took [2] millis
[956176] compass.DefaultSearchableMethodFactory search defaults: {max=10,
offset=0, reload=false, escape=false}
[956184] search.DefaultSearchMethod query: [+(+(name:purple^8.0
firstMessageText:purple^0.0 text:purple)) +(alias:ALIASConversationALIAS)],
[0] hits, took [1] millis

If the only change I make is to switch to  Lucene's StandardAnalyzer results
for both ribbon and purple are returned

Is this a bug or is there some strange intended behavior I'm not aware of?

Thanks

Steve