SegmentReader using too much memory?

2006-12-11 Thread Eric Jain
I've noticed that after stress-testing my application (uses Lucene 2.0) for 
I while, I have almost 200mb of byte[]s hanging around, the top two 
culprits being:


24 x SegmentReader.Norm.bytes = 112mb
 2 x SegmentReader.ones   =  16mb

The second one isn't a big deal, but I wonder what's the explanation for 
the first one?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SegmentReader using too much memory?

2006-12-11 Thread Eric Jain

Yonik Seeley wrote:

On 12/11/06, Eric Jain [EMAIL PROTECTED] wrote:
I've noticed that after stress-testing my application (uses Lucene 
2.0) for

I while, I have almost 200mb of byte[]s hanging around, the top two
culprits being:

24 x SegmentReader.Norm.bytes = 112mb
  2 x SegmentReader.ones   =  16mb


Each indexed field has a norm array that is the product of it's
index-time boost and the length normalization factor.  If you don't
need either, you can omit the norms (as it looks like you already have
on some fields given that ones is the fake norms used in place of
the real norms).


Thanks for the explanation.

Not sure where the fields without norms come from: I use neither 
Field.setOmitNorms nor Index.NO_NORMS anywhere!


I do want to use document boosting... Is that independent from field 
boosting? The length normalization on the other hand may not be necessary.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SegmentReader using too much memory?

2006-12-11 Thread Eric Jain

Yonik Seeley wrote:

It's read on demand, per indexed field.
So assuming your index is optimized (a single segment), then it
increases by one byte[] each time you search on a new field.


OK, makes sense then. Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Avoiding ParseExceptions

2006-06-06 Thread Eric Jain

Chris Nokleberg wrote:

I am using the QueryParser with a StandardAnalyzer. I would like to avoid
or auto-correct anything that would lead to a ParseException. For example,
I don't think you can get a parse exception from Google--even if you omit
a closing quote it looks like it just closes it for you (please correct me
if you know otherwise).


Would definitely be nice if the QueryParser had both a strict and a lenient 
mode. If you used the latter, it would of course wise to reflect the actual 
executed query back to the user, so it's clear what's going on.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexUpdateListener

2006-05-15 Thread Eric Jain

Chris Hostetter wrote:

THe only usefull callback/listner abstractions i can think of are when you
want to know if someone has finished with a set of changes -- wether that
change is adding one document, deleting one document, or adding/deleting a
whole bunch of documents isn't really relevent, you still want to know
that a complete set has been modified, so you aren't constantly flushing
caches or reopening IndexReaders everytime a single document is added.


Speaking of listeners: Would be great if there was a way to know when 
optimize() changes a document ID. Storing document IDs externally is the 
only way to merge Lucene queries with queries in a relational database 
efficiently (as far as I know), but the inability to track document ID 
changes complicates things a bit...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Preventing phrase queries from matching across lines

2006-04-29 Thread Eric Jain

Erik Hatcher wrote:

On Apr 28, 2006, at 5:35 AM, Eric Jain wrote:
What is the best way to prevent a phrase query such as eggs white 
matching fried eggs\nwhite snow?


Two possibilities I have thought about:

1. Replace all line breaks with a special string, e.g. newline.
2. Have an analyzer somehow increment the position of a term for each 
line break it encounters.


Latter seems a bit more complicated to implement, but it would also be 
more efficient, right? Or are there better options?


#2 shouldn't be too hard to implement, but you'll need to catch new 
lines in the initial tokenizer.  I'm not sure about the efficiency, both 
options would require a tokenizer detecting new lines and either 
injecting a new term or setting a flag such that the next term gets a 
position increment bump.


Thanks, #2 turned out to be easier to implement than expected. I should 
have precised that the efficiency I was concerned about was not the 
efficiency of the tokenization, but the impact of having all those 
additional newline term (positions) in the index.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Preventing phrase queries from matching across lines

2006-04-28 Thread Eric Jain
What is the best way to prevent a phrase query such as eggs white 
matching fried eggs\nwhite snow?


Two possibilities I have thought about:

1. Replace all line breaks with a special string, e.g. newline.
2. Have an analyzer somehow increment the position of a term for each line 
break it encounters.


Latter seems a bit more complicated to implement, but it would also be more 
efficient, right? Or are there better options?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Performance Issues

2006-03-28 Thread Eric Jain

thomasg wrote:

1) By default, Lucene only indexes the first 10,000 words from each
document. When increasing this default out-of-memory errors can occur. This
implies that documents, or large sections thereof, are loaded into memory.
ISYS has a very small memory footprint which is not affected by document
size nor number of documents.


As far as I know, documents do indeed have to be built in memory prior to 
indexing. But this shouldn't be a problem unless you have only a few 
megabytes of memory, or you have documents that are hundreds of megabytes 
large -- and such large documents should probably be split, anyway.




2) Lucene appears to be slow at indexing, at least by ISYS' standards.
Published performance benchmarks seem to vary between almost acceptable,
down to very poor. ISYS' file readers are already optimized for the fastest
text extraction possible.


Indexing performance is my main concern with Lucene, though there are 
several parameters that can be tuned and I haven't exhausted all of them yet...


Currently I am using:

  writer.setMergeFactor(100);
  writer.setMaxBufferedDocs(100);
  writer.setUseCompoundFile(false);

This allows me to build a 3GB index with about 3M documents in 6h on a 
2x2GHz Intel Xeon machine with 1GB of memory and a reasonably fast hard 
disk. There is some other stuff going on besides the indexing, but the 
indexing does seem to take up the greatest amount of time.


Note that Lucene also supports incremental updates.



3) The Lucene documentation suggests it can be slow at searching and can get
slower and slower the larger your indexes get. The tipping point is where
the index size exceeds the amount of free memory in your machine. This also
implies that whole indexes, or large portions of them, are loaded into
memory. The bigger the index, the more powerful the machine required. ISYS'
search speed is always proportional to the size of the result set. Index
size does not materially affect search speed and the index is never loaded
into memory. It also appears that Lucene requires hands-on tuning to keep
its search speed acceptable. ISYS' indexes are self-managing and do not
require any maintenance to keep them searchable at full speed.


Queries on the index mentioned above return results within a few 
milliseconds, with less than 256MB used by the VM, though some complex 
queries that contain a lot of frequent terms may take up to several 
seconds. I'm not sure how Lucene's searching performance can be tuned, but 
haven't bother to do so as it hasn't been a bottleneck, so far...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Appending * to each search term

2006-03-17 Thread Eric Jain

Florian Hanke wrote:
I'd like to append an * (create a WildcardQuery) to each search term in 
a query, such that a query that is entered as e.g. term1 AND term2 is 
modified (effectively) to term1* AND term2*.
Parsing the search string is not very elegant (of course). I'm thinking 
that overriding QueryParser#get(Boolean etc.)Query is the way to go, the 
way it's designed. But still, extracting terms and injecting them back 
in while operating on specific Query classes does not seem the way to go.

Can anyone perhaps suggest a nice alternative?


Perhaps you could subclass the QueryParser and override the getFieldQuery 
method:


protected Query getFieldQuery(String field, String term) {
  return new PrefixQuery(new Term(field, term));
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speed

2006-03-10 Thread Eric Jain

[EMAIL PROTECTED] wrote:

When I make search I get count = 37.
May be I do something not correctly? 


I assume you are ran both variants repeatedly, in the same process (start 
up costs etc)?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Get only count

2006-03-07 Thread Eric Jain

Anton Potehin wrote:

Now I create new search for get number of results. For example:

IndexSearcher is = ...

Query q = ... 


numberOfResults = Is.search(q).length();

Can I accelerate this example ? And how ?


Perhaps something like:

class CountingHitCollector
  implements HitCollector
{
  public int count;

  public void collect(int doc, float score)
  {
if (score  0.0f)
  ++count;
  }
}

...

CountingHitCollector c = new CountingHitCollector();
searcher.search(query, c);
int hits = c.count;


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sub search

2006-03-07 Thread Eric Jain

Anton Potehin wrote:
After it I want to not make a new search, 

 I want to make search among found results...

Perhaps something like this would work:

final BitSet results = toBitSet(Hits);
searcher.search(newQuery, new Filter() {
  public BitSet bits(IndexReader reader) {
return results;
  }
});


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiPhraseQuery

2006-03-06 Thread Eric Jain

Daniel Naber wrote:

Please try to add this to MultiPhraseQuery and let us know if it helps:

  public List getTerms() {
return termArrays;
  }


That is indeed all I need (the list wouldn't have to be mutable though). 
Any chance this could be committed?


Incidentally, would be helpful if the PrecedenceQueryParser instantiated 
MultiPhraseQueries via a call to an (overridable) getMultiPhraseQuery method.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MultiPhraseQuery

2006-03-05 Thread Eric Jain
I need to write a function that copies a MultiPhraseQuery and changes the 
field the query applies to. Unfortunately the API allows access to neither 
the contained terms nor the field! The other query classes I have so far 
dealt with all seem to allow access to the contained query terms...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



QueryParser dropping constraints?

2006-03-05 Thread Eric Jain
I've noticed that while the QueryParser (both the default QueryParser and 
the PrecedenceQueryParser) refuse to parse


  foo bar) baz

they both seem to interpret

  foo bar( baz

as

  foo bar

Bug or feature?

In any case, would be great if there was a strict mode, and a more 
lenient mode where incorrect syntax is ignored (as far as possible).


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing performance with Lucene 1.9

2006-03-01 Thread Eric Jain

Eric Jain wrote:
I'll rerun the indexing 
procedure with the old version overnight, just to be sure.


Just to confirm: There no longer seems to be any difference in indexing 
performance between the nightly build and 1.4.3.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Solr, the Lucene based Search Server

2006-03-01 Thread Eric Jain

Yonik Seeley wrote:

Solr is a new open-source search server that's based on Lucene, and
has XML/HTTP interfaces for updating and querying, declarative
specification of analyzers and field types via a schema, extensive
caching, replication, and a web admin interface.


Just had a look, quite impressive.

I noticed that you have a WordDelimiterFilter; any chance that this will be 
contributed back to Lucene? This class is really useful! (In fact I was 
just trying to write something similar myself...)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing performance with Lucene 1.9

2006-02-28 Thread Eric Jain

Daniel Naber wrote:
A fix has now been committed to trunk in SVN, it should be part of the next 
1.9 release.


Performance seems to have recovered, more or less, thanks!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing performance with Lucene 1.9

2006-02-28 Thread Eric Jain

Otis Gospodnetic wrote:
Regarding performance fix - if you can be more precise (is it really 

 just more or less or is it as good as before), that would be great
 for those of us itching to use 1.9.

Yes, I can confirm that performance differs by no more than 3.1 fraggles.

;-)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Frequency of phrase

2006-02-25 Thread Eric Jain

Doug Cutting wrote:
If you use a span query then you can get the actual number of phrase 
instances.


Thanks, good to know!

In this case (need to suggest phrase queries to the user) I've now settled 
with dividing the number of hits for a potential phrase by the number of 
documents that contain all terms in the phrase. Seems to be fast and work 
well...



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Frequency of phrase

2006-02-24 Thread Eric Jain

Dave Kor wrote:

Not sure if this is what you want, but what I have done is to issue
exact phrase queries to Lucene and counted the number of hits found.


This gives you the number of documents containing the phrase, rather than 
the number of occurrences of the phrase itself, but that may in fact be 
good enough...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Frequency of phrase

2006-02-23 Thread Eric Jain
This is somewhat related to a question sent to this list a while ago: Is 
there an efficient way to count the number of occurrences of a phrase (not 
term) in an index?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



QueryPrinter?

2006-02-18 Thread Eric Jain
I need to parse a query string, modify it a bit, and then output the 
modified query string. This works quite well with query.toString(), except 
that when I parse the query I set DEFAULT_OPERATOR_AND, and the output of 
BooleanQuery.toString() assumes DEFAULT_OPERATOR_OR... Would be great if 
this behavior could be changed through a static field, or perhaps someone 
has already written some kind of QueryPrinter that is a bit more flexible?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Generating phrase queries from term queries

2006-01-12 Thread Eric Jain

Chris Hostetter wrote:

(Assuming *I* understand it) what he's talking baout, is the ability for
his search GUI to display suggested phrase searches you may want to try
which consist of the words you just typed in grouped into phrases.


Yes, that's precisely what I am talking about. Sorry for being unclear.



Presumably, if multiple phrases in the source data can be found in the
permutations of hte search words, the least common are the ones you'd want
to sugggest -- which makes the problem a sort of SIP problem (ie: given an
extremely limited set of words, find the Statistically imporbably phrases
in the corpus made using only subsets of those words)


I'd already be happy to get *any* phrases :-)

If the phrases could be ranked, I might prefer to pick the *most frequent* 
phrases. For example:


  anopheles anopheles malaria

(anopheles anopheles is the latin name for the common mosquito)

I'd like to be able to suggest quoting this name to eliminate all the other 
mosquito species that also contain anopheles in their name.


There are lots of documents with anopheles anopheles. There may also be a 
document or two where anopheles happens to appear next to malaria, but 
these are less interesting here.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Generating phrase queries from term queries

2006-01-11 Thread Eric Jain

Paul Elschot wrote:

One way that might be better is to provide your own Scorer
that works on the term positions of the three or more terms.
This would be better for performance because it only uses one
term positions object per query term (a, b, and c here).


I'm trying to extract the actual phrases, rather than scoring documents 
with terms that appear in the same order higher (though that would seem 
like a good idea, too).


The idea is that once I have the phrases, I can suggest something like 
show only matches where a and b appear next to each other. Not terribly 
important, but if there was a simple and efficient way to accomplish this...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scoring by number of terms in field

2006-01-10 Thread Eric Jain

Paul Elschot wrote:

In case you prefer to use the maximum score over the clauses you
can use the DisjunctionMaxQuery from the development version.


Yes, that may help! I'll need to have a look...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Generating phrase queries from term queries

2006-01-10 Thread Eric Jain
Is there an efficient way to determine if two or more terms frequently 
appear next to each other sequence? For a query like:


a b c

one or more of the following suggestions could be generated:

a b c
a b c
a b c

I could of course just run a search with all possible combinations, but 
perhaps there is a better way?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Scoring by number of terms in field

2006-01-09 Thread Eric Jain
Lucene seems to prefer matches in shorter documents. Is it possible to 
influence the scoring mechanism to have matches in shorter fields score 
higher instead?


For example, a query for europe should rank:

1. title:Europe
2. title:History of Europe
3. title:Travel in Europe, Middle East and Africa
4. subtitle:Fairy Tales from Europe

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scoring by number of terms in field

2006-01-09 Thread Eric Jain

Paul Elschot wrote:

For example, a query for europe should rank:

1. title:Europe
2. title:History of Europe
3. title:Travel in Europe, Middle East and Africa
4. subtitle:Fairy Tales from Europe


Perhaps with this query (assuming the default implicit OR):

title:europe subtitle:europe^0.5 body:europe


This will ensure that match 4 appears at the end, but as far as I can see 
this won't help with getting matches 1-3 ordered correctly? Note that match 
1 for example may have a description field that contains a lot terms, but 
no mention of the query term.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]