Re: Cache index in RAMDirectory and evict

2006-01-12 Thread Kan Deng
Thanks, Otis. Also appreciate your wonderful book, Lucene in Action. The book is so well written that it makes me very curious about the low level design of the system, in addition to how to use it. Back the cache problem, I agree that the native OS file system can do most of the job for me.

Re: BTree

2006-01-12 Thread Kan Deng
I have similar problem about the internal indexing data structure According to Paolo Ferragina of Univ Pisa, B+tree with cluster is best for sorting. However, referring to the implementation of org.apache.lucene.search.IndexSearch, it looks like the impl doesn't take B+tree, never mention

Re: How to check, whether Index is optimized or not?

2006-01-12 Thread Dave Kor
Do we need to check if any documents are marked for deletion? On 1/12/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: I don't think we have a public API for that, but the index is considered optimized when it contains only a single segment. Then, we could add the following to IndexReader:

Re: Generating phrase queries from term queries

2006-01-12 Thread Eric Jain
Chris Hostetter wrote: (Assuming *I* understand it) what he's talking baout, is the ability for his search GUI to display suggested phrase searches you may want to try which consist of the words you just typed in grouped into phrases. Yes, that's precisely what I am talking about. Sorry for

Re: How to check, whether Index is optimized or not?

2006-01-12 Thread Erik Hatcher
A fully optimized index has only a single segment. If you're using the non-compound index format you will be able to tell by looking at the segments file in the index where only one segment would be listed. There are certainly programatic ways of telling too, but I don't have that

Re: How to check, whether Index is optimized or not?

2006-01-12 Thread Otis Gospodnetic
I don't think so. It's still a single segment. Close the reader, and you still have only one segment. You only have gaps from deleted docs, but I think that doesn't make the index unoptimized, even though optimizing such an index will remove the gaps. Otis - Original Message

AW: Boolean Query

2006-01-12 Thread Klaus
Hi, I have tried to study to lucene scoring in the default similarity. Can anyone explain me, how this similarity was designed? I have read a lot of IR literature, but I have never seen an equation like the one used in lucene. Why is this better then the normal cosine-measure? Thanks, Klaus

Re: Cache index in RAMDirectory and evict

2006-01-12 Thread John Haxby
Kan Deng wrote: 1. Performance. Since all the cached disk data resides outside JVM heap space, the access efficiency from Java object to those cached data cannot be too high. True, but you need to compare the relative speeds. If data has to be pulled from a file, then you're talking

about the wordnet program.

2006-01-12 Thread jason
hi, i am trying to use the Lucene WordNet program for my application. However, i got some problems. When i incorporate these files, Syns2Index.java, SynLookup.java, and SynExpand.java, I find some variables are not defined. For instance, in Syns2Index. java, writer.setMergeFactor(

Re: AW: Boolean Query

2006-01-12 Thread Doug Cutting
Klaus wrote: I have tried to study to lucene scoring in the default similarity. Can anyone explain me, how this similarity was designed? I have read a lot of IR literature, but I have never seen an equation like the one used in lucene. Why is this better then the normal cosine-measure? It

Re: Cache index in RAMDirectory and evict

2006-01-12 Thread Kan Deng
John, thanks a lot for your excellent reply. Especially, I think this sentence is very convincing, Well, you _can_ be a lot better since you know what you're doing. You can also be a _lot_ worse when you get it wrong. With such a high risk, probably I should try other tricks to improve the

Re: BTree

2006-01-12 Thread Daniel Naber
On Donnerstag 12 Januar 2006 05:47, shailesh kumar wrote: I had   looked at the document you had listed as well as used a  Hex editor to look at the segment files. .That is how I came to know about the lexicographic sorting. But was not sure if BTree is used.  If I understand correctly a

Re: BTree

2006-01-12 Thread Kan Deng
After reading into the source code, I think Lucene doeesn't use B+tree or other tree structure for index. A possible reason is that, since Lucene aims at handling gigabytes , it has to be cautious about the index file's size. B+tree may grow rapidly when the number of leaves grows. Hence,

Re: about the wordnet program.

2006-01-12 Thread Daniel Naber
On Donnerstag 12 Januar 2006 16:25, jason wrote: When i incorporate these files,  Syns2Index.java, SynLookup.java, and SynExpand.java, I find some variables are not defined. It depends on Lucene in SVN, some things in the Lucene API have changed since Lucene 1.4. So you need to get the latest

Re: Generating phrase queries from term queries

2006-01-12 Thread Chris Hostetter
: (Assuming *I* understand it) what he's talking baout, is the ability for : his search GUI to display suggested phrase searches you may want to try : which consist of the words you just typed in grouped into phrases. : : Yes, that's precisely what I am talking about. Sorry for being unclear.

Re: BTree

2006-01-12 Thread Doug Cutting
B-Tree's are best for random, incremental updates. They require log_b(N) disk accesses for inserts, deletes and accesses, where b is the number of entries per page, and N is the total number of entries in the tree. But that's too slow for text indexing. Rather Lucene uses a combination of

Re: BTree

2006-01-12 Thread Kan Deng
Many thanks, Doug. A quick question, which class implements the following logic? org.apache.lucene.search.IndexSearcher? For access, Lucene is equivalent to a B-Tree with all but the leaves cached in memory, so that accesses require only a single disk access. thanks, Kan --- Doug

Re: BTree

2006-01-12 Thread Yonik Seeley
On 1/12/06, Kan Deng [EMAIL PROTECTED] wrote: Many thanks, Doug. A quick question, which class implements the following logic? It looks to me like org.apache.lucene.index.TermInfosReader -Yonik - To unsubscribe, e-mail:

Re: BTree

2006-01-12 Thread Kan Deng
Thanks, Yonik. TermInfosReader is exactly the class I am looking for. Kan --- Yonik Seeley [EMAIL PROTECTED] wrote: On 1/12/06, Kan Deng [EMAIL PROTECTED] wrote: Many thanks, Doug. A quick question, which class implements the following logic? It looks to me like