Re: Lucene 2.0.1 release date

2006-12-19 Thread Doug Cutting
Steven Rowe wrote: 2.1 is much more likely to be the label used for the next release than 2.0.1. The roadmap in Jira shows 21 issues scheduled for 2.0.1. If there is in fact no intent to merge these into the 2.0 branch, these should probably be retargetted for 2.1.0, and the 2.0.1 version

Re: Lucene scoring: coord_q_d factor

2006-12-19 Thread Doug Cutting
Karl Koch wrote: Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous? We independently developed coordination-level matching combined with TFxIDF when I worked at Apple. This is documented in:

Re: Oracle and Lucene Integration

2006-11-22 Thread Doug Cutting
Marcelo Ochoa wrote: Then I'll move the code outside the lucene-2.0 code tree to be packed as subdirectory of the contrib area, for example. Other alternative is to make an small zip file and send it to the list as attach as a preliminary (alpha-alpha version ;) This sounds like great

Re: Searching by bit masks

2006-11-10 Thread Doug Cutting
Erick Erickson wrote: Something like Document doc = new Document(); doc.add(flag1, Y); doc.add(flag2, Y); IndexWriter.add(doc); Fields have overheads. It would be more efficient to implement this as a single field with a different value for each boolean flag (as others have suggested).

Re: DateTools oddity....

2006-10-18 Thread Doug Cutting
Michael J. Prichard wrote: I get this output: Tue Aug 01 21:15:45 EDT 2006 That's August 2, 2006 at 01:15:45 GMT. 20060802 Huh?! Should it be: 20060801 DateTools uses GMT. Doug - To unsubscribe, e-mail:

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Doug Cutting
Rob Staveley (Tom) wrote: Is there a tool I can use to see how much of the index is occupied by the different fields I am indexing? Note that IndexReader has a main() that will list the contents of compound index files. Doug

Re: How are results merged from a multisearcher?

2006-05-22 Thread Doug Cutting
Tom Emerson wrote: Thanks for the clarification. What then is the difference between a MultiSearcher and using an IndexSearcher on a MultiReader? The results should be identical. A MultiSearcher permits use of ParallelMultiSearcher and RemoteSearchable, for parallel and/or distributed

Re: Changing the scoring (newest doc date first)

2006-05-22 Thread Doug Cutting
Marcus Falck wrote: There is however one LARGE problem that we have run into. All search result should be displayed sorted with the newest document at top. We tried to accomplish this using Lucene's sort capabilites but quickly ran into large performance bottlenecks. So i figured since the

Re: Ask for a better solution for the case

2006-04-28 Thread Doug Cutting
hu andy wrote: Hi, I hava an application that need mark the retrieved documents which have been read. So the next time I needn't read the marked documents again. You could mark the documents as deleted, then later clear deletions. So long as you don't close the IndexReader, the deletions

Re: Lucene search benchmark/stress test tool

2006-04-27 Thread Doug Cutting
Sunil Kumar PK wrote: I want to know is there any possibility or method to merge the weight calculation of index 1 and its search in a single RPC instead of doing the both function in separate steps. To score correctly, weights from all indexes must be created before any can be searched.

Re: RAM Directory / querying Performance issue

2006-04-26 Thread Doug Cutting
Is this markedly faster than using an MMapDirectory? Copying all this data into the Java heap (as RAMDirectory does) puts a tremendous burden on the garbage collector. MMapDirectory should be nearly as fast, but keeps the index out of the Java heap. Doug z shalev wrote: I've rewritten

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Doug Cutting
karl wettin wrote: Do I have to worry about passing a null Directory to the default constructor? A null Directory should not cause you problems. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,

Re: MultiReader and MultiSearcher

2006-04-11 Thread Doug Cutting
Peter Keegan wrote: Oops. I meant to say: Does this mean that an IndexSearcher constructed from a MultiReader doesn't merge the search results and sort the results as if there was only one index? It doesn't have to, since a MultiReader *is* a single index. A quick test indicates that it does

Re: Distributed Lucene.. - clustering as a requirement

2006-04-10 Thread Doug Cutting
Dmitry Goldenberg wrote: For an enterprise-level application, Lucene appears too file-system and too byte-sequence-centric a technology. Just my opinion. The Directory API is just too low-level. There are good reasons why Lucene is not built on top of a RDBMS. An inverted index is not

Re: Lucene Document order not being maintained?

2006-04-05 Thread Doug Cutting
Dan Armbrust wrote: My indexing process works as follows (and some of this is hold-over from the time before lucene had a compound file format - so bear with me) I open up a File based index - using a merge factor of 90, and in my current test, the compound index format. When I have added

Re: Data structure of a Lucene Index

2006-03-30 Thread Doug Cutting
I talked about this a bit in a presentation at Haifa last year: http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf See the section on Seek versus Transfer. Doug Prasenjit Mukherjee wrote: It seems to me that lucene doesn't use B-tree for its indexing storage. Any

Re: Lucene Performance Issues

2006-03-28 Thread Doug Cutting
thomasg wrote: Hi, we are currently intending to implement a document storage / search tool using Jackrabbit and Lucene. We have been approached by a commercial search and indexing organisation called ISYS who are suggesting the following problems with using Lucene. We do have a requirement to

Re: span query scoring vs boolean query scoring

2006-03-27 Thread Doug Cutting
Vincent Le Maout wrote: I am missing something ? Is it intented or is it a bug ? Looks like a bug. Can you submit a patch? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: span query scoring vs boolean query scoring

2006-03-27 Thread Doug Cutting
Vincent Le Maout wrote: I am missing something ? Is it intented or is it a bug ? Looks like a bug. Can you please submit a bug report, and, ideally, attach a patch? Thanks, Doug - To unsubscribe, e-mail: [EMAIL

Re: Lucene indexing on Hadoop distributed file system

2006-03-27 Thread Doug Cutting
Igor Bolotin wrote: If somebody is interested - I can post our changes in TermInfosWriter and SegmentTermEnum code, although they are pretty trivial. Please submit this as a patch attached to a bug report. I contemplated making this change to Lucene myself, when writing Nutch's FsDirectory,

Re: Lucene indexing on Hadoop distributed file system

2006-03-27 Thread Doug Cutting
Igor Bolotin wrote: Does it make sense to change TermInfosWriter.FORMAT in the patch? Yes. This should be updated for any change to the format of the file, and this certainly constitutes a format change. This discussion should move to [EMAIL PROTECTED] Doug

Re: Multiple threads in Lucene

2006-03-23 Thread Doug Cutting
Olivier Jaquemet wrote: IndexReader.unlock(indexDir); // unlock directory in case of unproper shutdown This should be used very carefully. In particular, you should only call it when you are certain that no other applications are accessing the index. Doug

Re: lucene NFS support

2006-03-23 Thread Doug Cutting
Dai, Chunhe wrote: Does anyone know whether Lucene plans to support NFS in later release(2.0)? We are planning to integrate Lucene into our products and cluster support is definitely needed. We want to check whether NFS support is in the plan or not before implementing a new file locking

Re: Lookup Issues

2006-03-22 Thread Doug Cutting
The Hits-based search API is optimized for returning earlier hits. If you want the lowest-scoring matches, then you could reverse-sort the hits, so that these are returned first. Or you could use the TopDocs-based API to retrieve hits up to your toHits. (Hits-based search is implemented

Re: Throughput doesn't increase when using more concurrent threads

2006-03-17 Thread Doug Cutting
Peter Keegan wrote: I did some additional testing with Chris's patch and mine (based on Doug's note) vs. no patch and found that all 3 produced the same throughput - about 330 qps - over a longer period. Was CPU utilizaton 100%? If not, where do you think the bottleneck now is? Network? Or

Re: Lucene job

2006-03-17 Thread Doug Cutting
Michael Wechner wrote: Maybe it would make sense to sort it alphabetically [ ... ] +1 This should be sorted alphabetically be business name or last name. That's what it says on the page, although a few entries are out of place. Please feel free to fix this. Doug

Re: Can Lucene load more then 2GB into RAM memory?

2006-03-16 Thread Doug Cutting
is basically the same if not better!!! if anyone is interested let me know Doug Cutting [EMAIL PROTECTED] wrote: RAMDirectory is indeed currently limited to 2GB. This would not be too hard to fix. Please file a bug report. Better yet, attach a patch. I assume you're running a 64bit JVM. If so

Re: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Doug Cutting
Erick Erickson wrote: Could you point me to any explanation of *why* range queries expand this way? It's just what they do. They were contributed a long time ago, before things like RangeFilter or ConstantScoreRangeQuery were written. The latter are relatively recent additions to Lucene

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Doug Cutting
Are you changing the default mergeFactor or other settings? If so, how? Large mergeFactors are generally a bad idea: they don't make things faster in the long run and they chew up file handles. Are all searches reusing a single IndexReader? They should. This is the other most common

Re: PhraseQuery and edit distance slightly confusing.

2006-03-15 Thread Doug Cutting
Dawid Weiss wrote: I get the concept implemented in PhraseQuery but isn't calling it an edit distance a little bit far fetched? Yes, it should probably be called edit-distance-like or something. Only the marginal elements (minimum and maximum distance from their respective query positions)

Re: Can Lucene load more then 2GB into RAM memory?

2006-03-13 Thread Doug Cutting
RAMDirectory is indeed currently limited to 2GB. This would not be too hard to fix. Please file a bug report. Better yet, attach a patch. I assume you're running a 64bit JVM. If so, then MMapDirectory might also work well for you. Doug z shalev wrote: this is in continuation of a

Re: Lucene version 1.9

2006-03-07 Thread Doug Cutting
WATHELET Thomas wrote: I've created an index with the Lucene version 1.9 and when I try to open this index I have always this error mesage: java.lang.ArrayIndexOutOfBoundsException. if I use an index built with the lucene version 1.4.3 it's working. Wath's wrong? Are you perhaps trying to open

Re: Throughput doesn't increase when using more concurrent threads

2006-03-07 Thread Doug Cutting
Peter Keegan wrote: I ran a query performance tester against 8-cpu and 16-cpu Xeon servers (16/32 cpu hyperthreaded). on Linux. Here are the results: 8-cpu: 275 qps 16-cpu: 305 qps (the dual-core Opteron servers are still faster) Here is the stack trace of 8 of the 16 query threads during the

Re: Hacking proximity search: looking for feedback

2006-03-01 Thread Doug Cutting
Jeff Rodenburg wrote: Following on the Range Query approach, how is performance? I found the range approach (albeit with the exact values) to be slower than the parsed-string approach I posited. Note that Hoss suggested RangeFilter, not RangeQuery. Or perhaps ConstantScoreRangeQuery, which

Lucene 1.9-final release available

2006-03-01 Thread Doug Cutting
Release 1.9-final of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This release has many improvements since release 1.4.3, including new features, performance improvements, bug fixes, etc. For details, see:

Re: Indexing speed

2006-02-24 Thread Doug Cutting
revati joshi wrote: hi all, I just wnted to know how to increase the speed of indexing of files . I tried it by using Multithreading approach but couldn't get much better performance. It was same as it is in usual sequential indexing.Is there any other approach to get better

Re: Frequency of phrase

2006-02-24 Thread Doug Cutting
Eric Jain wrote: This gives you the number of documents containing the phrase, rather than the number of occurrences of the phrase itself, but that may in fact be good enough... If you use a span query then you can get the actual number of phrase instances. Doug

Lucene 1.9 RC1 release available

2006-02-22 Thread Doug Cutting
Release 1.9 RC1 of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This release candidate has many improvements since release 1.4.3, including new features, performance improvements, bug fixes, etc. For details, see:

Re: BM25 Similarity implementation

2006-02-16 Thread Doug Cutting
Trieschnigg, R.B. (Dolf) wrote: I would like to implement the Okapi BM25 weighting function using my own Similarity implementation. Unfortunately BM25 requires the document length in the score calculation, which is not provided by the Scorer. How do you want to measure document length? If

Re: Boosting

2006-02-13 Thread Doug Cutting
Sebastian Menge wrote: Or, to put it more simple, what does a boost of 2 or 10 _mean_ in contrast to a boost of 0.5 or 0.1 !? Boosts are simply multiplied into scores. So they only mean something in the context of the rest of the scoring mechanism.

Re: Performance tips?

2006-01-27 Thread Doug Cutting
Daniel Pfeifer wrote: We are sporting Solaris 10 on a Sun Fire-machine with four cores and 12GB of RAM and mirrored Ultra 320-disks. I guess I could try switching to FSDirectory and hope for the best. Or, since you're on a 64-bit platform, try MMapDirectory, which supports greater parallelism

Re: [SPAM] - Re: Performance tips? - Sending mail server found on bl.spamcop.net

2006-01-27 Thread Doug Cutting
Daniel Pfeifer wrote: Are we both talking about Lucene? I am using Lucene 1.4.3 and can't find a class called MapDirectory or MMapDirectory. It is post-1.4. You can download a nightly build of the current trunk at: http://cvs.apache.org/dist/lucene/java/nightly/ Doug

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Doug Cutting
Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Doug Cutting
Peter Keegan wrote: This is just fyi - in my stress tests on a 8-cpu box (that's 8 real cpus), the maximum throughput occurred with just 4 query threads. The query throughput decreased with fewer than 4 or greater than 4 query threads. The entire index was most likely in the file system cache,

Re: AW: Boolean Query

2006-01-12 Thread Doug Cutting
Klaus wrote: I have tried to study to lucene scoring in the default similarity. Can anyone explain me, how this similarity was designed? I have read a lot of IR literature, but I have never seen an equation like the one used in lucene. Why is this better then the normal cosine-measure? It

Re: BTree

2006-01-12 Thread Doug Cutting
B-Tree's are best for random, incremental updates. They require log_b(N) disk accesses for inserts, deletes and accesses, where b is the number of entries per page, and N is the total number of entries in the tree. But that's too slow for text indexing. Rather Lucene uses a combination of

Re: Merging with IndexWriter.addIndexes(...)

2005-12-08 Thread Doug Cutting
J.J. Larrea wrote: So... I notice that both IndexWriter.addIndexes(...) merge methods start and end with calls to optimize() on the target index. I'm not sure whether that is causing the unpacking and repacking I observe, but it does wonder whether they truly need to be there: I don't

Re: Lucene performance bottlenecks

2005-12-07 Thread Doug Cutting
Paul Elschot wrote: Querying the host field like this in a web page index can be dangerous business. For example when term1 is wikipedia and term2 is org, the query will match at least all pages from wikipedia.org. Note that if you search for wikipedia.org in Nutch this is interpreted as an

Re: Lucene performance bottlenecks

2005-12-07 Thread Doug Cutting
Andrzej Bialecki wrote: It's nice to have these couple percent... however, it doesn't solve the main problem; I need 50 or more percent increase... :-) and I suspect this can be achieved only by some radical changes in the way Nutch uses Lucene. It seems the default query structure is too

Re: Lucene performance bottlenecks

2005-12-02 Thread Doug Cutting
Andrzej Bialecki wrote: For a simple TermQuery, if the DF(term) is above 10%, the response time from IndexSearcher.search() is around 400ms (repeatable, after warm-up). For such complex phrase queries the response time is around 1 sec or more (again, after warm-up). Are you specifying

Re: IndexReader locking

2005-11-28 Thread Doug Cutting
IndexReader locks the index while opening it to prohibit an IndexWriter from deleting any of the files in that index until all are opened. Lock files are not stored in the index directory since write access to an index should not be required to lock it while opening an IndexReader. Doug

Re: Memory Usage

2005-11-17 Thread Doug Cutting
Daniel Noll wrote: I actually did throw a lot of terms in, and eventually chose one for the tests because it was the slowest query to complete of them all (hence I figured it was already spending some fairly long time in I/O, and would be penalised the most.) Every other query was around 7ms

Re: Memory Usage

2005-11-16 Thread Doug Cutting
Daniel Noll wrote: Timings were obtained by performing the same search 1,000 times and averaging the total time. This was then performed five times in a row to get the range that's displayed below. Memory usage was obtained using a 20-second sleep after loading the index, and then using the

Re: Filtering on a SpanQuery without losing spans

2005-11-16 Thread Doug Cutting
Greg K wrote: Now, however, I'd like to be able restrict the search to certain documents in the index, so I don't have to stream through a couple of thousand spans to produce the 10 excerpts on a subset of the documents. I've tried added a term to the SpanNearQueries that targets a keyword

Re: Memory Usage

2005-11-14 Thread Doug Cutting
Marvin Humphrey wrote: You *can't* set it on the reader end. If you could set it, the reader would get out of sync and break. The value is set per-segment at write time, and the reader has to be able to adapt on the fly. It would actually not be too hard to change things so that there was

Re: Sentence boundary storage

2005-10-30 Thread Doug Cutting
Chris Hostetter wrote: : One thing that I know has bogged me is when matching a phrase where I : would expect mathematical formula (which is just a subphrase). I : would have liked the phrase-query to extend as far as it wishes but not : passed a given token... would this be possible ? :

Re: query across fields?

2005-10-11 Thread Doug Cutting
Marc Hadfield wrote: In the SpanNear (or for that matter PhraseQuery), one can set a slop value where 0 (zero) means one following after the other. How can one differentiate between Terms at the **same** position vs. one after the other? The following queries only match x and y at the same

Re: query across fields?

2005-10-10 Thread Doug Cutting
Marc Hadfield wrote: I actually mention your option in my email: In principle I could store the full text in two fields with the second field containing the types without incrementing the token index. Then, do a SpanQuery for Johnson and name with a distance of 0. The resulting match

Re: query across fields?

2005-10-10 Thread Doug Cutting
Marc Hadfield wrote: I'll give Span Query's a try as they can handle the 0 increment issue. Note that PhraseQuery can now handle this too. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail:

Re: IllegalArgumentException: attempt to access a deleted document

2005-10-06 Thread Doug Cutting
Peter Kim wrote: I noticed one way to get around this is to use IndexReader.isDeleted() to check if it's deleted or not. The problem with that is I only have access to a MultiSearcher in my HitCollector which doesn't give me access to the underlying IndexReader. I don't want to have to open an

Re: Performance Improvments?

2005-10-04 Thread Doug Cutting
Palmer, Andrew MMI Woking wrote: I am looking at changing the value BufferedIndexOutput.BUFFER_SIZE from 1024 to maybe 8192. Has anyone done anything similar and did they get any performance improvements. I doubt this will speed things much. Generally I am looking to reduce the time it

Re: A very technical question.

2005-09-28 Thread Doug Cutting
Dawid Weiss wrote: I have a very technical question. I need to alter document score (or in fact: document boosts) for an existing index, but for each query. In other words, I'd like these to have pseudo-queries of the form: 1. civil war PREFER:shorter 2. civil war PREFER:longer for these two

Re: Is Lucene right for my app?

2005-09-18 Thread Doug Cutting
Jeff Rodenburg wrote: My suggestion to you: pick up a copy of Lucene in Action. [ ...] The authors lurk on this list. They're pretty chatty for lurkers. http://en.wikipedia.org/wiki/Lurker But good advice nonetheless! Cheers, Doug

Re: OutOfMemoryError on addIndexes()

2005-08-18 Thread Doug Cutting
Tony Schwartz wrote: I think you're jumping into the conversation too late. What you have said here does not address the problem at hand. That is, in TermInfosReader, all terms in the segment get loaded into three very large arrays. That's not true. Only 1/128th of the terms are loaded by

Re: OutOfMemory error when searching

2005-08-18 Thread Doug Cutting
Fredrik wrote: Opening the index with Luke, I can see the following: Number of fields: 17 Number of documents: 1165726 Number of terms: 6721726 The size of the index is approx 5,3 GB. Lucene version is 1.4.3. The index contains Norwegian terms, but lots of inline HTML, etc is probably

Re: Indexing document instances and retrieving instance attributes

2005-08-18 Thread Doug Cutting
Chris D wrote: Well in my case field order is important, but the order of the individual fields isn't. So I can speed up getFields to roughly O(1) by implementing Document as follows. Have you actually found getFields to be a performance bottleneck in your application? I'd be surprised if it

Re: OutOfMemoryError on addIndexes()

2005-08-18 Thread Doug Cutting
Tony Schwartz wrote: What about the TermInfosReader class? It appears to read the entire term set for the segment into 3 arrays. Am I seeing double on this one? p.s. I am looking at the current sources. see TermInfosReader.ensureIndexIsRead(); The index only has 1/128 of the terms, by

Re: Why is Hits.java not Serializable?

2005-08-10 Thread Doug Cutting
Ali Rouhi wrote: I can think of 3 reasons why search methods returning Hits objects are not exposed in Searchable: 1) Someone forgot to declare Hits Serializable 2) There is a fundamental reason the forms of search which return Hits objects cannot be called remotely, some non optimal form of

Re: docMap array in SegmentMergeInfo

2005-07-13 Thread Doug Cutting
Lokesh Bajaj wrote: For a very large index where we might want to delete/replace some documents, this would require a lot of memory (for 100 million documents, this would need 381 MB of memory). Is there any reason why this was implemented this way? In practice this has not been an issue. A

Re: Queries boost and scoring problems

2005-06-15 Thread Doug Cutting
The method Similarity.queryNorm() normalizes query term weights. To disable this you could define it to return 1.0 in your own Similarity implementation. http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#queryNorm(float) Doug Robichaud, Jean-Philippe wrote:

Re: Indexing multiple languages

2005-06-03 Thread Doug Cutting
Tansley, Robert wrote: What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each

Re: managing docids for ParallelReader

2005-06-03 Thread Doug Cutting
Sebastian Marius Kirsch wrote: I took up your suggestion to use a ParallelReader for adding more fields to existing documents. I now have two indexes with the same number of documents, but different fields. Does search work using the ParalleReader? One field is duplicated (the id field.)

Re: Preserving original HTML file offsets for highlighting, need HTMLTokenizer?

2005-06-03 Thread Doug Cutting
Fred Toth wrote: I'm thinking we need something like HTMLTokenizer which bridges the gap between StandardAnalyzer and an external HTML parser. Since so many of us are dealing with HTML, I would think this would be generally useful for many problems. It could work this way: Given this input:

Re: managing docids for ParallelReader (was Augmenting an existing index)

2005-05-31 Thread Doug Cutting
Matt Quail wrote: I have a similar problem, for which ParallelReader looks like a good solution -- except for the problem of creating a set of indices with matching document numbers. I have wondered about this as well. Are there any *sure fire* ways of creating (and updating) two indices so

Re: Distribution Strategies?

2005-05-10 Thread Doug Cutting
Steven J. Owens wrote: A friend just asked me for advice about synchronizing lucene indexes across a very large number of servers. I haven't really delved that deeply into this sort of stuff, but I've seen a variety of comments here about similar topics. Are there are any well-known

Re: Indexing in multi-threaded environment

2005-05-10 Thread Doug Cutting
Chris Lamprecht wrote: I've done exactly what you describe, using N threads where N is the number of processors on the machine, plus one more thread that writes to the file system index (since that is I/O-bound anyway). Since most of the CPU time is tokenizing/stemming/etc, the method works well.

Re: Results ranking on filtered multi-field query

2005-05-02 Thread Doug Cutting
Chuck Williams wrote: I found this to be a problem as well and created alternative classes, DistributedMultiFieldQueryParser and MaxDisjunctionQuery, which are available here: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 You might check these out and see if they provide the ranking

Re: Hungarian notation analyzer and phrase queries

2005-04-14 Thread Doug Cutting
Paul Smith wrote: So it sounds like there isn't a perfect solution, but I think the best tradeoff for me is to put them all in the same position unless anyone has more input on the subject? If they're all at the same position you can still use slop to match the phrase. So if 'power', 'query'

Re: Reverting QueryParser ?

2005-04-14 Thread Doug Cutting
Paul Libbrecht wrote: I am currently evaluating the need for an elaborate query data-structure (to be exchanged over XML-RPC) as opposed to working with plain strings. I'd opt for both. For example: search boolean-query required query-parser analyzer=...java based

Re: Update performance/indexwriter.delete()?

2005-04-14 Thread Doug Cutting
Roy Klein wrote: I think this is a better way of asking my original questions: Why was this designed this way? In order to optimize updates. Can it be changed to optimize updates? Updates are fastest when additions and deletions are separately batched. That is the design. Doug

Re: Seeking advice on index parameter settings for large index

2005-03-30 Thread Doug Cutting
Chuck Williams wrote: index.setMaxBufferedDocs(10); // Buffer 10 documents at a time in memory (they could be big) You might use a larger value here for the index with the small documents. I've sucessfully used values as high as a 1000 when indexing documents that average a few

Re: pre computing possible search results narrowing and hit counts on those

2005-03-30 Thread Doug Cutting
Antony Sequeira wrote: A user does a search for say condominium, and i show him the 50,000 properties that meet that description. I need two other pieces of information for display - 1. I want to show a select box on the UI, which contains all the cities that appear in those 50,000 documents 2.

Re: searcher question

2005-03-30 Thread Doug Cutting
Omar Didi wrote: I am having a large index (100GB) and when i run the following code : String indexLocation = servlet.getServletContext().getInitParameter( com.lucene.index ); logger.log( Level.INFO, got the index location from: + indexLocation ); searcher = new IndexSearcher(indexLocation);

Re: Problem with memory utilisation during Lucene search

2005-03-23 Thread Doug Cutting
Daniel Naber wrote: If that doesn't help: are you sure you're using Lucene the right way, e.g. having only one IndexReader/Searcher and using it for all searches? That's my first suggestion too. Memory consumption should not primarily grow per query, rather per IndexSearcher. You're seeing

Re: NumberTools

2005-03-22 Thread Doug Cutting
Chuck Williams wrote: If there is going to be any generalization to built-in sorting representations, I'd like to suggest two things be included: 1. Fix issue 34028 (delete the one word final) Done. 2. Include a provision for query-time parameters Can you provide a proposal? Doug

Re: QueryParser refactoring

2005-03-08 Thread Doug Cutting
sergiu gordea wrote: So .. here is an example of how I parse a simple query string provided by a user ... the user checks a few flags and writes test ko AND NOT bo and the resulting query.toString() is saved in the database: +(+(subject:test description:test keywordsTerms:test koProperties:test