Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton
I have an index with a date field.  I want to quickly find the minimum 
and maximum values in the index.


Is there a quick way to do this?  I looked at using TermInfos and 
finding the first one but how to I find the last?


I also tried the new sort API and the performance was horrible :-/

Any ideas?

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemming at Query time

2005-05-31 Thread Paul Libbrecht
You'd only need position-increment if using phrase-query... 
otherwise... positions are quite much ignored and you can expand the 
query with an or.

Eg, I'd do expand the query for breath to:

Term(breath)^2 or (Term(breathes) or Term(breathe) or Term(breathing))

I am not sure you can make a phrase-query with possible synonyms for 
phrase-constituents, you'd need to OR the queries with each set of 
possible variations (that grows quick! but do you know many people that 
put large phrase queries?)


paul

Le 30 mai 05, à 18:54, Andrew Boyd a écrit :


Hi All,
  Now that the QueryParser knows about position increments has anyone 
used this to do stemming at query time
and not at indexing time?  I suppose one would need a reverse stemmer. 
 Given the query breath it would need to inject breathe, breathes, 
breathing etc.


One benifit is that if you ever wanted to change your stemming 
algorithm you would not have to re-index.

Also your index would be closer to the actual documents.

Comments?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple keywords in one field?

2005-05-31 Thread Paul Libbrecht


Le 30 mai 05, à 22:13, Doug Hughes a écrit :
Ok, so more than one keyword can be stored in a keyword field.  
Interesting!


Yes, yes, yes!! You can do:

doc.add(link,xx);
doc.add(link,yy);

and matches will match any of them!
I found this in the book and not in the javadoc and I'd recommed adding 
it in the javadoc of the add method, it's a non-obvious goodness which 
suits all forms of scalability!


paul




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple keywords in one field?

2005-05-31 Thread Erik Hatcher


On May 31, 2005, at 4:06 AM, Paul Libbrecht wrote:



Le 30 mai 05, à 22:13, Doug Hughes a écrit :

Ok, so more than one keyword can be stored in a keyword field.   
Interesting!




Yes, yes, yes!! You can do:

doc.add(link,xx);
doc.add(link,yy);


Well, that's not quite correct API, but your point is accurate :)


and matches will match any of them!
I found this in the book and not in the javadoc and I'd recommed  
adding it in the javadoc of the add method, it's a non-obvious  
goodness which suits all forms of scalability!


Fields are made up of terms.  A Field.Keyword makes a single term for  
a field.  Adding multiple fields with the same name works even if its  
not Field.Keyword - it simply adds the term(s) to the field.


Good point about documentation - it would be worth noting this  
explicitly.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Tony Schwartz
Only way I see to do this is to get a TermEnum for that field, and grab the 
first.  Then
iterate until you find the last one.  This is similar behavior to the 
TermEnum.skipTo
method.  A better solution would be to record the minimum and maximum dates in 
the index
as you index them.  Each time you insert a new date, update the min/max if 
needed.  This
data would reside outside the index of course.


Tony Schwartz
[EMAIL PROTECTED]
What we need is more cowbell.

 I have an index with a date field.  I want to quickly find the minimum
 and maximum values in the index.

 Is there a quick way to do this?  I looked at using TermInfos and
 finding the first one but how to I find the last?

 I also tried the new sort API and the performance was horrible :-/

 Any ideas?

 Kevin

 --


 Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
 See irc.freenode.net #rojo if you want to chat.

 Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

Kevin A. Burton, Location - San Francisco, CA
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: managing docids for ParallelReader (was Augmenting an existing index)

2005-05-31 Thread Doug Cutting

Matt Quail wrote:

I have a similar problem, for which ParallelReader looks like a good
solution -- except for the problem of creating a set of indices with
matching document numbers.



I have wondered about this as well. Are there any *sure fire* ways of  
creating (and updating) two indices so that doc numbers in one index  
deliberately correspond to doc numbers in the other index?


If you add the documents in the same order to both indexes and perform 
the same deletions on both indexes then they'll have the same numbers.


If this is not convenient, then you could add an id field to all 
documents in the primary index.  Then create (or re-create) the 
secondary index by iterating through the values in a FieldCache of this 
id field.


ParallelReader was not really designed to support incremental updates of 
fields, but rather to accellerate batch updates.  For incremental 
updates you're probably better served by updating a single index.


One could define an acl IndexReader subclass that generates termDoc 
lists on the fly by looking in an external database.  This would require 
a mapping between Lucene document ids and external document IDs.  A 
FieldCache, as described above, could serve that purpose.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Chris Lamprecht
Lucene rewrites RangeQueries into a BooleanQuery containing a bunch of
OR'd terms.  If you have too many terms (dates in your case), you will
run into a TooManyClauses exception.  I think the default is about
1024; you can set it with BooleanQuery.setMaxClauseCount().

On 5/31/05, Kevin Burton [EMAIL PROTECTED] wrote:
 Andrew Boyd wrote:
 
 How about using range query?
 
 private Term begin, end;
 
 begin = new Term(dateField,  
 DateTools.dateToString(Date.valueOf(backInTimeStringDate)));
 end = new Term(dateField,  
 DateTools.dateToString(Date.valueOf(farFutureStringDate)));
 
 RangeQuery query = new RangeQuery(begin, end, true);
 
 IndexSearcher searcher = new IndexSearcher(directory);
 
 Hits hits = searcher.search(query);
 
 Document minDoc = hits.doc(0);
 Document maxDoc = hits.doc(hits.length()-1);
 
 String minDateString = minDoc.get(dateField);
 String maxDateString = maxDoc.get(dateField);
 
 
 
 This certainly is an interesting solution.  How would lucene score this
 result set?  The first and last will depend on the score...
 
 I  guess I can build up a quick test
 
 Kevin
 
 --
 
 
 Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
 See irc.freenode.net #rojo if you want to chat.
 
 Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
   Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Kevin Burton

Andrew Boyd wrote:


How about using range query?

private Term begin, end;

begin = new Term(dateField,  
DateTools.dateToString(Date.valueOf(backInTimeStringDate)));
end = new Term(dateField,  
DateTools.dateToString(Date.valueOf(farFutureStringDate)));

 

Ha.. crap.  That won't work either.  We have too many values and I get 
the dreaded:


Exception in thread main 
org.apache.lucene.search.BooleanQuery$TooManyClauses


Fun.

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com. 
See irc.freenode.net #rojo if you want to chat.


Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

  Kevin A. Burton, Location - San Francisco, CA
 AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Chris Hostetter

have you tried the suggestion i made regarding FieldCache from the first
thread in which you asked this question?

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/[EMAIL 
PROTECTED]




: Date: Tue, 31 May 2005 11:42:46 -0700
: From: Kevin Burton [EMAIL PROTECTED]
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: Finding minimum and maximum value of a field?
:
: Andrew Boyd wrote:
:
: How about using range query?
: 
: private Term begin, end;
: 
: begin = new Term(dateField,  
DateTools.dateToString(Date.valueOf(backInTimeStringDate)));
: end = new Term(dateField,  
DateTools.dateToString(Date.valueOf(farFutureStringDate)));
: 
: 
: 
: Ha.. crap.  That won't work either.  We have too many values and I get
: the dreaded:
:
: Exception in thread main
: org.apache.lucene.search.BooleanQuery$TooManyClauses
:
: Fun.
:
: --
:
:
: Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
: See irc.freenode.net #rojo if you want to chat.
:
: Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
:
:Kevin A. Burton, Location - San Francisco, CA
:   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
: GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Adding to the termFreqVector

2005-05-31 Thread Grant Ingersoll
Is your intent to persist the changed vector somehow or just use it in
your application for the immediate search?

TermFreqVector is an interface, so if you aren't persisting, I would
write a wrapper class around the one that is returned by Lucene that has
add/set methods on it for manipulating the underlying vector and pass
that around in your application.  Other option, is to get the source and
modify the TermFreqVector for your needs.

Persistance is a bit harder, but would probably involve manipulating
the document and then re-indexing it so that it's new vector has the
updated frequencies by adding some dummy terms onto the document.

Is that what you are looking for?

 [EMAIL PROTECTED] 5/30/2005 12:37:54 PM 

How would one go about adding additional terms to a field which is not

stored literally, but instead has a termFreqVector?  For example:

If DocumentA was indexed originally with:
myTermField: red green blue

termFreqVector would look like:
   freq {myTermField: red/1, green/1, blue/1}

Now, I'd like to add some more terms (red, yellow) and desire the 
termFreqVector to look like this:
   freq {myTermField: red/2, green/1, blue/1, yellow/1}

It would seem like there would be a covenant way of accomplishing this,

but I must be missing something.

Any advice would be greatly appreciated!


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi,

Interesting topic. I thought about this as well. I wanted to index
Chinese text with English, i.e., I want to treat the English text
inside Chinese text as English tokens rather than Chinese text tokens.

Right now I think maybe I have to write a special analyzer that takes
the text input, and detect if the character is an ASCII char, if it
is, assembly them together and make it as a token, if not, then, make
it as a Chinese word token.

So, bottom line is, just one analyzer for all the text and do the
if/else statement inside the analyzer.

I would like to learn more thoughts about this!

Thanks,

Jian

On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote:
 Hi all,
 
 The DSpace (www.dspace.org) currently uses Lucene to index metadata
 (Dublin Core standard) and extracted full-text content of documents
 stored in it.  Now the system is being used globally, it needs to
 support multi-language indexing.
 
 I've looked through the mailing list archives etc. and it seems it's
 easy to plug in analyzers for different languages.
 
 What if we're trying to index multiple languages in the same site?  Is
 it best to have:
 
 1/ one index for all languages
 2/ one index for all languages, with an extra language field so searches
 can be constrained to a particular language
 3/ separate indices for each language?
 
 I don't fully understand the consequences in terms of performance for
 1/, but I can see that false hits could turn up where one word appears
 in different languages (stemming could increase the changes of this).
 Also some languages' analyzers are quite dramatically different (e.g.
 the Chinese one which just treats every character as a separate
 token/word).
 
 On the other hand, if people are searching for proper nouns in metadata
 (e.g. DSpace) it may be advantageous to search all languages at once.
 
 
 I'm also not sure of the storage and performance consequences of 2/.
 
 Approach 3/ seems like it might be the most complex from an
 implementation/code point of view.
 
 Does anyone have any thoughts or recommendations on this?
 
 Many thanks,
 
  Robert Tansley / Digital Media Systems Programme / HP Labs
   http://www.hpl.hp.com/personal/Robert_Tansley/
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Adding to the termFreqVector

2005-05-31 Thread Ryan Skow

Adding new terms and re-indexing the document is the desired behavior.

One (non-scalable) solution would be to parse the toString of the
termFreqVector (freq {myTermField: red/2, green/1, blue/1}) and create a
new string representation of the expanded terms:  (red red green blue)

This obviously isn't a good solution.  Finding a way to simply do a
document.addTerm(red); and then re-indexing would be ideal.



 Is your intent to persist the changed vector somehow or just use it in
 your application for the immediate search?

 TermFreqVector is an interface, so if you aren't persisting, I would
 write a wrapper class around the one that is returned by Lucene that has
 add/set methods on it for manipulating the underlying vector and pass
 that around in your application.  Other option, is to get the source and
 modify the TermFreqVector for your needs.

 Persistance is a bit harder, but would probably involve manipulating
 the document and then re-indexing it so that it's new vector has the
 updated frequencies by adding some dummy terms onto the document.

 Is that what you are looking for?

 [EMAIL PROTECTED] 5/30/2005 12:37:54 PM 

 How would one go about adding additional terms to a field which is not

 stored literally, but instead has a termFreqVector?  For example:

 If DocumentA was indexed originally with:
 myTermField: red green blue

 termFreqVector would look like:
freq {myTermField: red/1, green/1, blue/1}

 Now, I'd like to add some more terms (red, yellow) and desire the
 termFreqVector to look like this:
freq {myTermField: red/2, green/1, blue/1, yellow/1}

 It would seem like there would be a covenant way of accomplishing this,

 but I must be missing something.

 Any advice would be greatly appreciated!


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-05-31 Thread Erik Hatcher
Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It  
will keep English as-is (removing stop words, lowercasing, and such)  
and separate CJK characters into separate tokens also.


Erik


On May 31, 2005, at 5:49 PM, jian chen wrote:


Hi,

Interesting topic. I thought about this as well. I wanted to index
Chinese text with English, i.e., I want to treat the English text
inside Chinese text as English tokens rather than Chinese text tokens.

Right now I think maybe I have to write a special analyzer that takes
the text input, and detect if the character is an ASCII char, if it
is, assembly them together and make it as a token, if not, then, make
it as a Chinese word token.

So, bottom line is, just one analyzer for all the text and do the
if/else statement inside the analyzer.

I would like to learn more thoughts about this!

Thanks,

Jian

On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote:


Hi all,

The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text content of documents
stored in it.  Now the system is being used globally, it needs to
support multi-language indexing.

I've looked through the mailing list archives etc. and it seems it's
easy to plug in analyzers for different languages.

What if we're trying to index multiple languages in the same  
site?  Is

it best to have:

1/ one index for all languages
2/ one index for all languages, with an extra language field so  
searches

can be constrained to a particular language
3/ separate indices for each language?

I don't fully understand the consequences in terms of performance for
1/, but I can see that false hits could turn up where one word  
appears

in different languages (stemming could increase the changes of this).
Also some languages' analyzers are quite dramatically different (e.g.
the Chinese one which just treats every character as a separate
token/word).

On the other hand, if people are searching for proper nouns in  
metadata
(e.g. DSpace) it may be advantageous to search all languages at  
once.



I'm also not sure of the storage and performance consequences of 2/.

Approach 3/ seems like it might be the most complex from an
implementation/code point of view.

Does anyone have any thoughts or recommendations on this?

Many thanks,

 Robert Tansley / Digital Media Systems Programme / HP Labs
  http://www.hpl.hp.com/personal/Robert_Tansley/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi, Erik,

Thanks for your info. 

No, I haven't tried it yet. I will give it a try and maybe produce
some Chinese/English text search demo online.

Currently I used Lucene as the indexing engine for Velocity mailing
list search. I have a demo at www.jhsystems.net.

It is yet another mailing list search for Velocity, but I combined
date as well as full text search together.

I only used lucene for indexing the textual content, and combined
database search with lucene search in returning the results.

The other interesting thought I have is: maybe it is possible to use
Lucene's merge segments mechanism to write a java based simple file
system. Which of course, does not require constant compact operation.
The file system could be based on one file only, where segments are
just part of the big file. It might be really efficient in terms of
adding/deleting the objects all the time.

Lastly, any comments welcome for www.jhsystems.net Velocity search.

Thanks,

Jian
www.jhsystems.net

On 5/31/05, Erik Hatcher [EMAIL PROTECTED] wrote:
 Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
 will keep English as-is (removing stop words, lowercasing, and such)
 and separate CJK characters into separate tokens also.
 
  Erik
 
 
 On May 31, 2005, at 5:49 PM, jian chen wrote:
 
  Hi,
 
  Interesting topic. I thought about this as well. I wanted to index
  Chinese text with English, i.e., I want to treat the English text
  inside Chinese text as English tokens rather than Chinese text tokens.
 
  Right now I think maybe I have to write a special analyzer that takes
  the text input, and detect if the character is an ASCII char, if it
  is, assembly them together and make it as a token, if not, then, make
  it as a Chinese word token.
 
  So, bottom line is, just one analyzer for all the text and do the
  if/else statement inside the analyzer.
 
  I would like to learn more thoughts about this!
 
  Thanks,
 
  Jian
 
  On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote:
 
  Hi all,
 
  The DSpace (www.dspace.org) currently uses Lucene to index metadata
  (Dublin Core standard) and extracted full-text content of documents
  stored in it.  Now the system is being used globally, it needs to
  support multi-language indexing.
 
  I've looked through the mailing list archives etc. and it seems it's
  easy to plug in analyzers for different languages.
 
  What if we're trying to index multiple languages in the same
  site?  Is
  it best to have:
 
  1/ one index for all languages
  2/ one index for all languages, with an extra language field so
  searches
  can be constrained to a particular language
  3/ separate indices for each language?
 
  I don't fully understand the consequences in terms of performance for
  1/, but I can see that false hits could turn up where one word
  appears
  in different languages (stemming could increase the changes of this).
  Also some languages' analyzers are quite dramatically different (e.g.
  the Chinese one which just treats every character as a separate
  token/word).
 
  On the other hand, if people are searching for proper nouns in
  metadata
  (e.g. DSpace) it may be advantageous to search all languages at
  once.
 
 
  I'm also not sure of the storage and performance consequences of 2/.
 
  Approach 3/ seems like it might be the most complex from an
  implementation/code point of view.
 
  Does anyone have any thoughts or recommendations on this?
 
  Many thanks,
 
   Robert Tansley / Digital Media Systems Programme / HP Labs
http://www.hpl.hp.com/personal/Robert_Tansley/
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]