QueryParser: open ended range queries

2005-04-05 Thread Yonik Seeley
Was there any later thread on the QueryParser supporting open ended range queries after this: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg07973.html Just curious. I plan on overriding the current getRangeQuery() anyway since it currently doesn't run the endpoints through the

Re: QueryParser: open ended range queries

2005-04-05 Thread Yonik Seeley
On Apr 5, 2005 3:43 PM, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 5, 2005, at 2:49 PM, Yonik Seeley wrote: Just curious. I plan on overriding the current getRangeQuery() anyway since it currently doesn't run the endpoints through the analyzer. What will you do when multiple tokens

Re: Strange sort error

2005-04-14 Thread Yonik Seeley
I haven't tried it, but I think the fix should be easy... never throw that exception. Either check for null before the loop, or in the loop. Original code for native int sorting: TermEnum termEnum = reader.terms (new Term (field, )); try { if (termEnum.term() == null)

Re: Update performance/indexwriter.delete()?

2005-04-14 Thread Yonik Seeley
An IndexReader is required to, given a term, find the document number to mark deleted. Yeah, most the time it makes sense to do deletions off the IndexReader. There are times, however, when it would be nice for deletes to be able to be concurrent with adds. Q: can docids change after an add()

Re: FW: CVS Lucene 2.0

2005-04-26 Thread Yonik Seeley
Term.field is interned, so equals() isn't needed. -Yonik On 4/26/05, Peter Veentjer - Anchor Men [EMAIL PROTECTED] wrote: [...] Term other = (Term) o; return field.equals(other.field) text.equals(other.text); } Third: if the field values of refer to

Re: CVS Lucene 2.0

2005-04-26 Thread Yonik Seeley
I don't think at this point anything structural has been proposed as different between 1.9 and 2.0. Are any of Paul Elschot's query and scorer changes being considered for 2.0? -Yonik - To unsubscribe, e-mail: [EMAIL

Re: CVS Lucene 2.0

2005-05-01 Thread Yonik Seeley
I can't say whats actually ready, but I am very interested in sparse filter representations. I'm working on a project that needs to dynamic categorization of search results, and this requires caching thousands of filters. http://issues.apache.org/bugzilla/show_bug.cgi?id=32965

Re: expert question: concurrent, asynchronous batch updates and real-time reads on very large, heavily used index

2005-05-10 Thread Yonik Seeley
Once an IndexReader is opened on an index, it's view of that index never changes. Reuse the same IndexReader for all query requests and ony reopen it after you do your optimize. -Yonik - To unsubscribe, e-mail: [EMAIL

Re: sanity check - large, long running index updates and concurrent read-only service

2005-05-11 Thread Yonik Seeley
When created, an IndexReader opens all the segment files and hangs onto them. Any updates to the index through an IndexWriter (including commit and optimize) will not affect already open IndexReaders. -Yonik On 5/11/05, Naomi Dushay [EMAIL PROTECTED] wrote: It's my impression that with optimize

Re: FieldCache and Sort

2005-06-06 Thread Yonik Seeley
Why do we keep the lookup array around? The actual field value is needed to sort results from multiple searchers (multisearcher). -Yonik On 6/1/05, John Wang [EMAIL PROTECTED] wrote: Hi: In the current Lucene sorting implementation, FieldCache is used to retrieve 2 arrays, the lookup

Re: Lucene and numerical fields search

2005-07-12 Thread Yonik Seeley
I use ConstantScoreRangeQuery for this purpose: http://issues.apache.org/bugzilla/show_bug.cgi?id=34673 -Yonik On 7/12/05, Rifflard Mickaël [EMAIL PROTECTED] wrote: Hi all, I'm using Lucene as a fulltext search engine since a year now and this one works well for this. Now, I want to add

Re: Any problems with a failed IndexWriter optimize call?

2005-08-01 Thread Yonik Seeley
If all segments were flushed to the disk (no adds since the last time the index writer was opened), then it seems like the index should be fine. The big question I have is what happens when there are in-memory segments in the case of an OOM exception during an optimize? Is data loss possible?

Re: max number of documents

2005-08-10 Thread Yonik Seeley
I think it would be 2 billion. There are many places that woudn't like the overflow to negative docids I think... We have indexes up to 200M documents, so 1/10th the max. 64 bit ids are definitely something to think about for the near future. Who's got Lucene indexes nearing the maximum

Re: QueryParser exception on escaped backslash preceding ) character

2005-08-12 Thread Yonik Seeley
I can verify that bad things are going on with backslashes and the query parser in lucene 1.4.3 foo:hi\\ == foo:hi\ (foo:hi\\) == exception foo:hi\\ == foo:hi\\ foo:hi\\^3 == foo:hi\^3 foo:hi \\ there == foo:hi \\ there foo:'hi there' == foo:'hi foo:\ == exception foo:hi\ == foo:hi So there

intra-word delimiters

2005-08-15 Thread Yonik Seeley
Does anyone have solutions for handling intraword delimiters (case changes, non-alphanumeric chars, and alpha-numeric transitions)? If the source text is Wi-Fi, we want to be able to match the following user queries: wi fi wifi wi-fi wi+fi WiFi One way is to index wi, fi, and wifi. However,

Re: intra-word delimiters

2005-08-15 Thread Yonik Seeley
That was the plan, but step (4) really seems problematic. - term expansion this way can lead to a lot of false matches - phrase queries with many bordering words break - settingt term positions such that phrase queries work on all combos of subwords is non-trivial. It seems like a better

Re: WhiteSpace Tokenizer question

2005-08-23 Thread Yonik Seeley
It's the QueryParser, not the Analyzer. When the query parser sees multiple tokens from what looks like a single word, it puts them in a phrase query. I think the only way to change that behavior would be to modify the QueryParser. -Yonik On 8/23/05, Dan Armbrust [EMAIL PROTECTED] wrote: I

Re: limit lucene result

2005-09-07 Thread Yonik Seeley
The Hits object retrieves the documents lazily, so just ask it for the first 100. -Yonik On 9/7/05, haipeng du [EMAIL PROTECTED] wrote: The reason that I want to limit returned result is that I do not want to get out of memory problem. I index lucene with 3 million documents. Sometimes,

Re: cancel search

2005-09-08 Thread Yonik Seeley
You could create your own HitCollector that checked a flag on each hit, and throw an exception if it was set. In a separate thread, you could set the flag to cancel the search. -Yonik Now hiring -- http://tinyurl.com/7m67g On 9/8/05, Kunemann Frank [EMAIL PROTECTED] wrote: The problem is

Re: cancel search

2005-09-08 Thread Yonik Seeley
You could create your own HitCollector that checked a flag on each hit, and throw an exception if it was set. In a separate thread, you could set the flag to cancel the search. -Yonik Now hiring -- http://tinyurl.com/7m67g On 9/8/05, Kunemann Frank [EMAIL PROTECTED] wrote: The problem is

Re: Weird time results doing wildcard queries

2005-09-08 Thread Yonik Seeley
The Hits class collects the document ids from the query in batches. If you iterate beyond what was collected, the query is re-executed to collect more ids. You can use the expert level search methods on IndexSearcher if this isn't what you want. -Yonik On 9/8/05, Richard Krenek [EMAIL

Re: Weird time results doing wildcard queries

2005-09-08 Thread Yonik Seeley
. On 9/8/05, Yonik Seeley [EMAIL PROTECTED] wrote: The Hits class collects the document ids from the query in batches. If you iterate beyond what was collected, the query is re-executed to collect more ids. You can use the expert level search methods on IndexSearcher if this isn't

Re: IndexReader delete doc! delete terms?

2005-09-09 Thread Yonik Seeley
Nope. The IndexReader simply sets a bit in a separate bitvector that marks the doc as deleted. All info associated with the document are removed after an IndexWriter merges the segment containing that doc with another (optimize will merge all segments and hence remove remnants of all deleted

Re: IndexReader delete doc! delete terms?

2005-09-12 Thread Yonik Seeley
://tinyurl.com/7m67g On 9/10/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Ok... but can i search in documents which are marked for deletion? Bye --- Ursprüngliche Nachricht --- Von: Yonik Seeley [EMAIL PROTECTED] An: java-user@lucene.apache.org Betreff: Re: IndexReader delete doc! delete

term scoring (idf) question

2005-09-15 Thread Yonik Seeley
I'm trying to figure out why idf is multiplied twice into the score of a term query. It sort of makes sense if you have just one term... the original weight is idf*boost, and the normalization factor is 1/(idf*boost), so you multiply in the idf again if you want the final score to contain an

JIRA bug messages

2005-09-16 Thread Yonik Seeley
I just updated a bug via JIRA, http://issues.apache.org/jira/browse/LUCENE-383 and I didn't see it come to any mailing list like it used to with bugzilla. Should it have? Is there a new mailing list to sign up for? -Yonik Now hiring -- http://tinyurl.com/7m67g

Re: Boost value is lost

2005-09-21 Thread Yonik Seeley
You don't get the boost back directly... it's folded into the norm for the field and does affect the score when you search against the index. -Yonik On 9/21/05, Steve Gaunt [EMAIL PROTECTED] wrote: Hi all, I was hoping someone could shed some light on this? When I set a boost for a

Re: Lucene 1.9 and Java 1.4

2005-09-28 Thread Yonik Seeley
I think your best bet for supporting Java 1.3 would be sticking with Lucene 1.4. One of the new classes that I am using is the ConstantScoreQuery. I am not sure if this is going to be included in Lucene 1.9 or not but this does make use of Java 1.4. w.r.t. java.util.BitSet, it's a pain, and I

Re: A very technical question.

2005-09-28 Thread Yonik Seeley
Field length isn't stored... It gets folded into the norm (see Similarity.lengthNorm) along with the boost and indexing time. A couple of approaches: a) index the field twice with two different Similarity implementations b) store term vectors, derive the length from them and store in the

Re: TermDocs.freq()

2005-10-03 Thread Yonik Seeley
See IndexWriter.setMaxFieldLength() -Yonik Now hiring -- http://tinyurl.com/7m67g On 10/3/05, Tricia Williams [EMAIL PROTECTED] wrote: To follow up on my post from Thursday. I have written a very basic test for TermPositions. This test allows me to identify that only the first 10001 tokens

Re: change of document ids accross optimize

2005-10-06 Thread Yonik Seeley
ids can also change as the result of an add(), not just optimize(). An add can trigger a segment merge which can squeeze out deleted docs and thus change the ids. I think everything else you said is pretty much correct. On 10/6/05, Jack McBane [EMAIL PROTECTED] wrote: I know that in general if

Re: docMap array in SegmentMergeInfo

2005-10-11 Thread Yonik Seeley
I'm not sure that looks like a safe patch. Synchronization does more than help prevent races... it also introduces memory barriers. Removing synchronization to objects that can change is very tricky business (witness the double-checked locking antipattern). -Yonik Now hiring --

Re: docMap array in SegmentMergeInfo

2005-10-12 Thread Yonik Seeley
, Yonik Seeley [EMAIL PROTECTED] wrote: We've been using this in production for a while and it fixed the extremely slow searches when there are deleted documents. Who was the caller of isDeleted()? There may be an opportunity for an easy optimization to grab the BitVector and reuse

Re: docMap array in SegmentMergeInfo

2005-10-12 Thread Yonik Seeley
Here's the patch: http://issues.apache.org/jira/browse/LUCENE-454 It resulted in quite a performance boost indeed! On 10/12/05, Yonik Seeley [EMAIL PROTECTED] wrote: Thanks for the trace Peter, and great catch! It certainly does look like avoiding the construction of the docMap

Re: Non scored results

2005-10-21 Thread Yonik Seeley
It can... By the time the hitcollector is called, the documents are already scored, so you don't save any time there. But since they haven't been sorted yet, you do save the time it would take to put all the hits through the priority queue to find the top n. -Yonik On 10/21/05, Volodymyr

Re: java on 64 bits

2005-10-21 Thread Yonik Seeley
1) make sure the failure was due to an OutOfMemory exception and not something else. 2) if you have enough memory, increase the max JVM heap size (-Xmx) 3) if you don't need more than 1.5G or so of heap, use the 32 bit JVM instead (depending on architecture, it can acutally be a little faster

Re: queries and filters

2005-10-21 Thread Yonik Seeley
The closest thing to that is http://issues.apache.org/jira/browse/LUCENE-330 -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 10/21/05, Rick Hillegas [EMAIL PROTECTED] wrote: I have another newbie question based on a quick glance at some classes in* org.apache.lucene.search.Query*

Re: Improving sort performance

2005-10-22 Thread Yonik Seeley
I'm not sure what type of score you are trying to do, but maybe FunctionQuery would help. http://issues.apache.org/jira/browse/LUCENE-446 -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 10/22/05, Jeff Rodenburg [EMAIL PROTECTED] wrote: I have a custom sort that completes

Re: Improving sort performance

2005-10-22 Thread Yonik Seeley
/slink?231706 On 10/22/05, Jeff Rodenburg [EMAIL PROTECTED] wrote: This is really interesting, I haven't revved our code to this version yet. Does the score returned by FunctionQuery supersede underlying relevance scoring or is it rolled in at some base class? -- j On 10/22/05, Yonik Seeley

Re: score formula in Similarity javadoc

2005-10-26 Thread Yonik Seeley
With respect to different terms in a boolean query, they will contribute to the total score proportional to idf^2, so I think the javadoc as it exists now is probably more correct. A single TermQuery will have a final score with a single idf factor in it, but that's because of the queryweight

Re: Bad explanations

2005-10-26 Thread Yonik Seeley
To be more literal, I actually meant explain(query,hits.id(i)) On 10/26/05, Yonik Seeley [EMAIL PROTECTED] wrote: Typo... try explain(query,doc) instead of (query,i) :-)

Re: Segments file format

2005-10-26 Thread Yonik Seeley
Hi Bill, I can't seem to correctly parse it either... Format = FF FF FF FF Version = 00 00 00 00 00 00 00 28 SegCount = 00 00 00 4E = 00 00 00 04 -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 10/26/05, Bill Tschumy [EMAIL PROTECTED] wrote: I have been trying to reconstitute

Re: Segments file format

2005-10-26 Thread Yonik Seeley
There is a currently undocumented extra int32. Here's the code for writing the segment file: output.writeInt(FORMAT); // write FORMAT output.writeLong(++version); // every write changes the index output.writeInt(counter); // write counter output.writeInt(size()); // write infos for (int i = 0; i

Re: Scoring formula

2005-11-05 Thread Yonik Seeley
Lucene 1.2 is before my time, but check if the functions are implemented the same as the current version (they probably are). Scores are not naturally = 1, but for most search methods (including all that return Hits) they are normalized to be between 1 and 0 if the highest score is greater than

Re: Question about scoring normalisation

2005-11-06 Thread Yonik Seeley
On 11/5/05, Sameer Shisodia [EMAIL PROTECTED] wrote: if so the top score should always be 1.0. Isn't so. Or does boosting multiple individual fields wreck that ? sameer The top score is scaled back to 1.0 *only* if it's greater than 1.0 So hits with scores of 4.0,2.0 will be normalized to

Re: RangeQuery over many indexed documents seems to be buggy

2005-11-09 Thread Yonik Seeley
The limited number of terms in a range query should hopefully be addressed before Lucene 1.9 comes out. I'd give you a reference to the bug, but JIRA seems like it's currently down. search for ConstantScoreRangeQuery if interested. -Yonik Now hiring -- http://forms.cnet.com/slink?231706

Re: going from Document - IndexReader's docid

2005-11-09 Thread Yonik Seeley
There really isn't a generic way... you have to search for the document. If you have a unique id field in your document, you can find the document id quickly via IndexReader.termDocs(term) -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 11/9/05, [EMAIL PROTECTED] [EMAIL PROTECTED]

Re: Sorting: string vs int

2005-11-09 Thread Yonik Seeley
The FieldCache (which is used for sorting), uses arrays of size maxDoc() to cache field values. String sorting will involve caching a String[] (or StringIndex) and int sorting will involve caching an int[]. Unique string values are shared in the array, but the String values plus the String[]

Re: Sorting: string vs int

2005-11-10 Thread Yonik Seeley
Here is a snippet of the current StringIndex class: public static class StringIndex { /** All the term values, in natural order. */ public final String[] lookup; /** For each document, an index into the lookup array. */ public final int[] order; } The order field is used for

Re: Performance Question

2005-11-11 Thread Yonik Seeley
The IndexSearcher(MultiReader) will be faster (it's what's used for indicies with multiple segments too). -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 11/11/05, Mike Streeton [EMAIL PROTECTED] wrote: I have several indexes I want to search together. What performs better a single

Re: Performance Question

2005-11-11 Thread Yonik Seeley
Look at IndexReader.open() It actually uses a MultiReader if there are multiple segments. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 11/11/05, Charles Lloyd [EMAIL PROTECTED] wrote: You should run your own tests, but I found the MultiReader to be slower than a regular

Re: Optimize vs non optimized index

2005-11-16 Thread Yonik Seeley
Do you have any deletions in the non-optimized version of the index? If so, a bug was fixed recently that made for some very slow queries: http://issues.apache.org/jira/browse/LUCENE-454 You could also try a smaller mergeFactor, which would slow indexing, but decrease the number of segments, and

Re: Field Boosting

2005-11-17 Thread Yonik Seeley
Right. getBoost() is meaningless on retrieved documents (it isn't set when a doc is read from the index). There really should have been a separate class for documents retrieved from an index vs documents added... but that's water way under the bridge. -Yonik On 11/17/05, Erik Hatcher [EMAIL

Re: Spans, appended fields, and term positions

2005-11-20 Thread Yonik Seeley
Does it make sense to add an IndexWriter setting to specify a default position increment gap to use when multiple fields are added in this way? Per-field might be nice... The good news is that Analyzer is an abstract class, and not an Interface, so we could add something to it without

Re: Spans, appended fields, and term positions

2005-11-20 Thread Yonik Seeley
It depends on Document.fields() of a stored and retrieved document: does it return all the appended field parts as separate Fields, or does it only return one Field with all parts appended? Separate fields. Stored fields are returned back to you verbatim. -Yonik Now hiring --

Re: High CPU utilization with sort

2005-11-20 Thread Yonik Seeley
I haven't done measurements, but the first query with a sort on a particular field will involve filling the field-cache and that can take a while (especially for numeric fields). If you haven't already, you should compare the query times of a warmed searcher. Sorted queries will still take

Re: High CPU utilization with sort

2005-11-20 Thread Yonik Seeley
On 11/20/05, Jeff Rodenburg [EMAIL PROTECTED] wrote: Why are numeric fields more onerous in filling the field-cache? Float.parseFloat() or Integer.parseInt() for each unique term. -Yonik Now hiring -- http://forms.cnet.com/slink?231706

Re: Re-Opening IndexSearcher

2005-11-20 Thread Yonik Seeley
Karl, You are opening IndexSearchers in this code but not closing them. If GC finalizers don't happen to run before you run out of file handles, you will get exceptions. You could close the IndexSearcher after every request, but it would lead to very poor performance. Better to keep a single

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Yonik Seeley
This is expected behavior: you are probably quickly becoming CPU bound (which isn't a bad thing). More threads only help when some threads are waiting on IO, or if you actually have a lot of CPUs in the box. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 11/21/05, Oren Shir [EMAIL

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Yonik Seeley
On 11/21/05, Oren Shir [EMAIL PROTECTED] wrote: It is rather sad if 10 threads reach the CPU limit. I'll check it and get back to you. It's about performance and throughput though, not about number of threads it takes to reach saturation. In a 2 CPU box, I would say that the ideal situation is

Re: Spans, appended fields, and term positions

2005-11-21 Thread Yonik Seeley
On 11/21/05, Erik Hatcher [EMAIL PROTECTED] wrote: Modifying Analyzer as you have suggested would require DocumentWriter additionally keep track of the field names and note when one is used again. For position increments, it doesn't have to be tracked. The patch to DocumentWriter could also

Re: How does lucene choose a field for sort?

2005-11-21 Thread Yonik Seeley
On 11/21/05, Erik Hatcher [EMAIL PROTECTED] wrote: Neither. It'll throw an exception. Just don't rely on it to throw an exception either though... the checking is not comprehensive. One should treat sorting on a field with more than one value per document as undefined. -Yonik Now hiring --

Re: Lotka's law and Lucene

2005-11-22 Thread Yonik Seeley
And of course Doug still does a lot of work on Lucene, but often leaves the commit to someone else. On 11/22/05, Daniel Naber [EMAIL PROTECTED] wrote: On Dienstag 22 November 2005 19:33, aurora wrote: (http://www.javarants.com/B1823453972/C1460559707/E20051119163857/index. html). Lucene is

Re: reverse sort

2005-11-28 Thread Yonik Seeley
G, I think it's that AUTO sorting again... Check out this bug: http://issues.apache.org/jira/browse/LUCENE-463 If you specify a string sort explicitly, it should work. If you are using a multisearcher, please upgrade to the latest lucene version (there have been some sorting bug fixes).

FunctionQuery

2005-11-30 Thread Yonik Seeley
I finally got around to updating FunctionQuery: http://issues.apache.org/jira/browse/LUCENE-446 Comments suggestions welcome. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 - To unsubscribe, e-mail: [EMAIL PROTECTED]

Re: Lucene performance bottlenecks

2005-12-07 Thread Yonik Seeley
I checked out readVInt() to see if I could optimize it any... For a random distribution of integers 200 I was able to speed it up a little bit, but nothing to write home about: old newpercent Java14-client : 13547 12468 8% Java14-server: 6047 5266 14%

Re: JVM Crash in Lucene

2005-12-08 Thread Yonik Seeley
The only problems I've had with 1.5 JVM crashes and Lucene was related to stack overflow... try increasing the stack size and see of anything different happens. My crashes happened while trying to use Luke to open a 4GB index with thousands of indexed fields. -Yonik

Re: JVM Crash in Lucene

2005-12-11 Thread Yonik Seeley
Sounds like it's a hotspot bug. AFAIK, hotspot doesn't just compile a method once... it can do optimization over time. To work around it, have you tried pre previous version: 1.5_05? It's possible it's a fairly new bug. We've been running with that version and Lucene 1.4.3 without problems (on

Re: JVM Crash in Lucene

2005-12-11 Thread Yonik Seeley
You also might try -Xbatch or -Xcomp to see if that fixes it (or reproduces it faster). Here's a great list of JVM options: http://blogs.sun.com/roller/resources/watt/jvm-options-list.html -Yonik On 12/11/05, Yonik Seeley [EMAIL PROTECTED] wrote: Sounds like it's a hotspot bug. AFAIK, hotspot

Re: DistributingMultiFieldQueryParser and DisjunctionMaxQuery

2005-12-14 Thread Yonik Seeley
On 12/14/05, Chuck Williams [EMAIL PROTECTED] wrote: If there is some specific reason it is not deemed suitable to commit, please let me know. It is much harder to use DisjunctionMaxQuery without this parser. Hey Chuck, I committed DisjunctionMaxQuery after I took the time to understand it,

Re: all stop words in exact phrase get 0 hits

2005-12-15 Thread Yonik Seeley
Are you using the same Analyzer for both indexing and querying (or the same StopFilter at least)? -Yonik On 12/15/05, javier muguruza [EMAIL PROTECTED] wrote: Hi, Suppose I have a query like this: +attachments:purpose that returns N hits. If I add another condition +attachments:purpose

Re: all stop words in exact phrase get 0 hits

2005-12-16 Thread Yonik Seeley
I can't reproduce this behavior with the current version of Lucene. +text:solar = 112 docs +text:a a a = 0 docs because a is a stop word +textsolar +text:a a a = 112 docs -Yonik On 12/15/05, javier muguruza [EMAIL PROTECTED] wrote: Hi, Suppose I have a query like this:

Re: Filtering after Query

2005-12-18 Thread Yonik Seeley
W.r.t. ConstantScoringQuery, it contains a minor bug: it doesn't the handle the case where the Filter.bits method would return null. Can Filter.bits() ever return null though? AFAIK, that's not in the contract. The Filter.getBits() javadoc says: Returns a BitSet with true for

Re: ParseQuery with quotes

2005-12-20 Thread Yonik Seeley
On 12/20/05, John Powers [EMAIL PROTECTED] wrote: I would like to be able to search for 19 inches with the quote. So I get a query like this: Line 1: +( (name:19*^4 ld:19*^2 sd:19*^3 kw:19*^1) ) That won't work, so I wanted to escape the quotes.The docs said to use a backslash. So

Re: ParseQuery with quotes

2005-12-20 Thread Yonik Seeley
Here's more on query-parser escaping gotchas: http://www.mail-archive.com/java-user@lucene.apache.org/msg02354.html -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: ParseQuery with quotes

2005-12-20 Thread Yonik Seeley
On 12/20/05, John Powers [EMAIL PROTECTED] wrote: Ok, I understand the .toString() part. But, if I have some 19 in the text of these items, and I do a search with 19, that has been escaped before parsingwhy am I not getting anything? The indexer analyzer took them out? So then to find

Re: Indexing and deleting simultaneously..

2005-12-27 Thread Yonik Seeley
That shouldn't happen. What platform(s) have you seen this on, and with what Lucene versions? -Yonik On 12/27/05, Chris Lu [EMAIL PROTECTED] wrote: This is generally true, most of the time. But my experience is, there can be some FileNotFoundException, if your searcher is opened for a while,

Re: More than 32 required/prohibited clauses in query

2005-12-27 Thread Yonik Seeley
That's a Lucene 1.4 limitation, gone in the latest 1.9 development version. If you want to stick with 1.4, try restructuring your query to avoid this restriction. -Yonik On 12/27/05, Alex Kiselevski [EMAIL PROTECTED] wrote: I got a strange exception More than 32 required/prohibited clauses in

Re: More than 32 required/prohibited clauses in query

2005-12-27 Thread Yonik Seeley
/lucene/java/trunk lucene cd lucene ant -Yonik On 12/27/05, Alex Kiselevski [EMAIL PROTECTED] wrote: I didn't find a mention about 1.9 version in Lucene site -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 27, 2005 4:52 PM To: java-user

Re: Indexing and deleting simultaneously..

2005-12-27 Thread Yonik Seeley
on Any Database On 12/27/05, Yonik Seeley [EMAIL PROTECTED] wrote: That shouldn't happen. What platform(s) have you seen this on, and with what Lucene versions? -Yonik On 12/27/05, Chris Lu [EMAIL PROTECTED] wrote: This is generally true, most of the time. But my experience

Re: Starts with query?

2006-01-05 Thread Yonik Seeley
Off the top of my head: 1) also index the field untokenized and use a straight prefix query 2) index a magic token at the start of the title and include that in a phrase query: _START_ the quick 3) use a SpanFirst query (but you have to make the Java Query object yourself) -Yonik On 1/5/06,

Re: Starts with query?

2006-01-05 Thread Yonik Seeley
Check out PhrasePrefixQuery. -Yonik On 1/5/06, Paul Smith [EMAIL PROTECTED] wrote: first off response to my own post, I meant PhraseQuery instead. But, since we're only tokenizing this field ,and not storing the entire contents of the field, I'm not sure this is ever going to work, is it?

Re: Starts with query?

2006-01-05 Thread Yonik Seeley
That's deprecated now of course... so you want MultiPhraseQuery. -Yonik On 1/5/06, Yonik Seeley [EMAIL PROTECTED] wrote: Check out PhrasePrefixQuery. -Yonik On 1/5/06, Paul Smith [EMAIL PROTECTED] wrote: first off response to my own post, I meant PhraseQuery instead. But, since we're

Re: need some advice/help with negative query.

2006-01-06 Thread Yonik Seeley
Should we should detect the case of all negative clauses and throw in a MatchAllDocsQuery? I guess this would be done in the QueryParser, but one could also make a case for doing it in the BooleanQuery. -Yonik On 1/6/06, Erik Hatcher [EMAIL PROTECTED] wrote: With Lucene's trunk, there is a

Re: how to forbid prefetching found Documents?

2006-01-07 Thread Yonik Seeley
The actual fields of found documents are not prefetched, only the ids. And imagine, that user is on fourth page - reading first 100 document is waste of time. As it relates to document ids, you must know what the first 100 are if you are to know which ones follow. If you want more control

Re: need some advice/help with negative query.

2006-01-07 Thread Yonik Seeley
+1 from me. -Yonik On 1/7/06, Erik Hatcher [EMAIL PROTECTED] wrote: +1 to Hoss's suggested enhancement to QueryParser. I'll volunteer to implement this barring any objections in the next day or so. Erik - To

Re: numDocs() after undeleteAll()

2006-01-08 Thread Yonik Seeley
Are you using the latest version of Lucene (after Dec 8th)? There was a bug fix regarding this: http://issues.apache.org/jira/browse/LUCENE-479 -Yonik On 1/8/06, Koji Sekiguchi [EMAIL PROTECTED] wrote: Hello Luceners! steps: 1. index has 15 docs and has no deleted docs 2. call

Re: Deleting a Document

2006-01-08 Thread Yonik Seeley
Closing the reader that did the deletion causes the deletions to be flushed to the index. After that point, any new readers you open will see the deletions. Any old index readers that were opened before the deleting reader was closed will still see the old version of the index (without the

Re: Lock obtain timed out + IndexSearcher

2006-01-09 Thread Yonik Seeley
Lock files aren't contained in the index directory, but in the standard temp directory. remove the file referenced in the exception: C:\DOCUME~1\harini\LOCALS~1\Temp\lucene-1b92bc48efc5c13ac4ef4ad9fd17c158-commit.lock -Yonik On 1/9/06, Harini Raghavan [EMAIL PROTECTED] wrote: Hi All, All of a

Re: RF and IDF

2006-01-11 Thread Yonik Seeley
Click on Source Repository off of the main Lucene page. Here is a pointer to the search package containing TermQuery/Weight/Scorer http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/search/?sortby=file#dirlist Look in TermQuert for TermWeight (it's an inner class).

Re: Generating phrase queries from term queries

2006-01-11 Thread Yonik Seeley
A phrase query with slop scores matching documents higher when the terms are closer together. a b c~1 -Yonik On 1/10/06, Eric Jain [EMAIL PROTECTED] wrote: Is there an efficient way to determine if two or more terms frequently appear next to each other sequence? For a query like: a b c

Re: BTree

2006-01-12 Thread Yonik Seeley
On 1/12/06, Kan Deng [EMAIL PROTECTED] wrote: Many thanks, Doug. A quick question, which class implements the following logic? It looks to me like org.apache.lucene.index.TermInfosReader -Yonik - To unsubscribe, e-mail:

Re: non-standard query

2006-01-19 Thread Yonik Seeley
Check out minNrShouldMatch in BooleanQuery in the latest lucene version (1.9 dev version in subversion). -Yonik On 1/19/06, Anton Potehin [EMAIL PROTECTED] wrote: Suppose that the search query contains 20 terms. It is necessary to find all documents which contains at least 5 terms from search

Re: Limiting hits?

2006-01-19 Thread Yonik Seeley
PROTECTED] wrote: Are you certain? I am quite sure we retrieve a huge amount of data if there are thousands of matches to one query. -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Thu 2006-01-19 16:45 To: java-user@lucene.apache.org Subject: Re: Limiting hits

Re: Document similarity

2006-01-20 Thread Yonik Seeley
If you didn't want to store term vectors you could also run the document fields through the analyzer yourself and collect the Tokens (you should still have the fields you just indexed... no need to retrieve it again). -Yonik On 1/20/06, Klaus [EMAIL PROTECTED] wrote: In my case, i need to

Re: Sorting by calculated custom score at search time

2006-01-24 Thread Yonik Seeley
It's not in subversion yet though ;-) You have to look here: http://issues.apache.org/jira/browse/LUCENE-446 I haven't committed it, because we may be able to do better (maybe removing the difference between Query and ValueSource so you could freely mix the two and not have to wrap ValueSource

Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Yonik Seeley
Thanks Peter, that's useful info. Just out of curiosity, what kind of box is this? what CPUs? -Yonik On 1/25/06, Peter Keegan [EMAIL PROTECTED] wrote: This is just fyi - in my stress tests on a 8-cpu box (that's 8 real cpus), the maximum throughput occurred with just 4 query threads. The

Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Yonik Seeley
On 1/25/06, Peter Keegan [EMAIL PROTECTED] wrote: It's a 3GHz Intel box with Xeon processors, 64GB ram :) Nice! Xeon processors are normally hyperthreaded. On a linux box, if you cat /proc/cpuinfo, you will see 8 processors for a 4 physical CPU system. Are you positive you have 8 physical

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Yonik Seeley
Hmmm, can you run the 64 bit version of Windows (and hence a 64 bit JVM?) We're running with heap sizes up to 8GB (RH Linux 64 bit, Opterons, Sun Java 1.5) -Yonik On 1/26/06, Peter Keegan [EMAIL PROTECTED] wrote: Paul, I tried this but it ran out of memory trying to read the 500Mb .fdt file.

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Yonik Seeley
threads, which is pretty impressive. Another way around the concurrency limit is to run multiple jvms. The throughput of each is less, but the aggregate throughput is higher. Peter On 1/26/06, Yonik Seeley [EMAIL PROTECTED] wrote: Hmmm, can you run the 64 bit version of Windows (and hence

  1   2   3   4   5   6   >