Thanks, Otis.
Also appreciate your wonderful book, Lucene in
Action. The book is so well written that it makes me
very curious about the low level design of the system,
in addition to how to use it.
Back the cache problem, I agree that the native OS
file system can do most of the job for me.
I have similar problem about the internal indexing
data structure
According to Paolo Ferragina of Univ Pisa, B+tree with
cluster is best for sorting. However, referring to the
implementation of
org.apache.lucene.search.IndexSearch, it looks like
the impl doesn't take B+tree, never mention
Do we need to check if any documents are marked for deletion?
On 1/12/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:
I don't think we have a public API for that, but the index is considered
optimized when it contains only a single segment.
Then, we could add the following to IndexReader:
Chris Hostetter wrote:
(Assuming *I* understand it) what he's talking baout, is the ability for
his search GUI to display suggested phrase searches you may want to try
which consist of the words you just typed in grouped into phrases.
Yes, that's precisely what I am talking about. Sorry for
A fully optimized index has only a single segment. If you're using
the non-compound index format you will be able to tell by looking at
the segments file in the index where only one segment would be
listed. There are certainly programatic ways of telling too, but I
don't have that
I don't think so. It's still a single segment. Close the reader, and you
still have only one segment. You only have gaps from deleted docs, but I think
that doesn't make the index unoptimized, even though optimizing such an index
will remove the gaps.
Otis
- Original Message
Hi,
I have tried to study to lucene scoring in the default similarity. Can
anyone explain me, how this similarity was designed? I have read a lot of IR
literature, but I have never seen an equation like the one used in lucene.
Why is this better then the normal cosine-measure?
Thanks,
Klaus
Kan Deng wrote:
1. Performance.
Since all the cached disk data resides outside JVM
heap space, the access efficiency from Java object to
those cached data cannot be too high.
True, but you need to compare the relative speeds. If data has to be
pulled from a file, then you're talking
hi,
i am trying to use the Lucene WordNet program for my application. However, i
got some problems.
When i incorporate these files, Syns2Index.java, SynLookup.java, and
SynExpand.java, I find some variables are not defined.
For instance, in Syns2Index. java,
writer.setMergeFactor(
Klaus wrote:
I have tried to study to lucene scoring in the default similarity. Can
anyone explain me, how this similarity was designed? I have read a lot of IR
literature, but I have never seen an equation like the one used in lucene.
Why is this better then the normal cosine-measure?
It
John, thanks a lot for your excellent reply.
Especially, I think this sentence is very convincing,
Well, you _can_ be a lot better since you know what
you're
doing. You can also be a _lot_ worse when you get it
wrong.
With such a high risk, probably I should try other
tricks to improve the
On Donnerstag 12 Januar 2006 05:47, shailesh kumar wrote:
I had looked at the document you had listed as well as used a Hex
editor to look at the segment files. .That is how I came to know about
the lexicographic sorting. But was not sure if BTree is used. If I
understand correctly a
After reading into the source code, I think Lucene
doeesn't use B+tree or other tree structure for index.
A possible reason is that, since Lucene aims at
handling gigabytes , it has to be cautious about the
index file's size. B+tree may grow rapidly when the
number of leaves grows. Hence,
On Donnerstag 12 Januar 2006 16:25, jason wrote:
When i incorporate these files, Syns2Index.java, SynLookup.java, and
SynExpand.java, I find some variables are not defined.
It depends on Lucene in SVN, some things in the Lucene API have changed
since Lucene 1.4. So you need to get the latest
: (Assuming *I* understand it) what he's talking baout, is the ability for
: his search GUI to display suggested phrase searches you may want to try
: which consist of the words you just typed in grouped into phrases.
:
: Yes, that's precisely what I am talking about. Sorry for being unclear.
B-Tree's are best for random, incremental updates. They require
log_b(N) disk accesses for inserts, deletes and accesses, where b is the
number of entries per page, and N is the total number of entries in the
tree. But that's too slow for text indexing. Rather Lucene uses a
combination of
Many thanks, Doug.
A quick question, which class implements the following
logic?
org.apache.lucene.search.IndexSearcher?
For access, Lucene is equivalent to a B-Tree
with all but the leaves cached in memory, so
that accesses require only a single disk access.
thanks,
Kan
--- Doug
On 1/12/06, Kan Deng [EMAIL PROTECTED] wrote:
Many thanks, Doug.
A quick question, which class implements the following
logic?
It looks to me like org.apache.lucene.index.TermInfosReader
-Yonik
-
To unsubscribe, e-mail:
Thanks, Yonik.
TermInfosReader is exactly the class I am looking for.
Kan
--- Yonik Seeley [EMAIL PROTECTED] wrote:
On 1/12/06, Kan Deng [EMAIL PROTECTED] wrote:
Many thanks, Doug.
A quick question, which class implements the
following
logic?
It looks to me like
19 matches
Mail list logo