Not that I can think about. But, if you have any cached field data,
norms array, that could be huge.
Would be interested in knowing from others regarding this topic as well.
Jian
On 5/29/08, Alex [EMAIL PROTECTED] wrote:
Hi,
other than the in memory terms (.tii), and the few kilobytes of
Lucene gurus,
I have a question regarding RAMDirectory usage. Can the IndexWriter keep
adding documents to the index meanwhile the IndexReader is open on this
RAMDirectory and searches going on?
I know in a FSDirectory case, the IndexWriter can add documents to the index
meanwhile IndexReader
I have seen two different designs for incremental index updates.
1) Have two copies of indexes A and B. The incremental updates happens on A
index while B index is being used for search. Then, hot swap the two
indexes. Bring B index up to date and perform incremental updates
thereafter. In this
For reading word document as text, you can try AntiWord.
I have written a simplified Lucene that does Max words match.
For example, if you are searching for aa, bb, cc, then, the document that
contains all words (aa, bb, cc) will be definitely ranked higher than
documents containing either aa,
Hi, Karl,
Therer have been quite some discussions regarding the too many open files
problem. From my understanding, it is due to Lucene trying to open multiple
segments at the same time (during search/merging segments), and the
operating system wouldn't allow opening that many file handles.
If
Hi,
In case you are using StandardAnalyzer, there is a stop word list. I have
used StandardAnalyzer.STOP_WORDS, which is a String[].
Cheers,
Jian
On 10/31/05, Rob Young [EMAIL PROTECTED] wrote:
Hi,
Is there an easy way to list stop words that were removed from a string?
I'm using the
Hi,
It seems what you want to achieve could be implemented using the Cover
Density algorithm. I am not sure if any existing query classes in the Lucene
distribution does this already. But in case not, this is what I am think
about:
Make a custom query class, called CoverDensityQuery, which is
Hi,
Also, I think you may try to increase the indexInterval, it is set to 128,
but getting it larger, the .tii files will be smaller. Since .tii files are
loaded into memory as a whole, so, your memory usage might be smaller.
However, this change might affect your search speed. So, be careful
Hi, Trond,
It should be no problem for Lucene to handle 6 million documents.
For your query, it seems you want to do a disjunctive (or'ed) query for
multiple terms, 10 terms or 1 terms for example. The worst case I can
think of is, you can very easily write your own query class to handle
Hi, Koji,
I think you are right, the max num of documents should be Integer.MAX_VALUE.
Some more points below:
1) I double checked the Lucene documentation. It mentioned in the file
format that SegSize is UInt32. I don't think this is accurate, as UInt32 is
around 4 billion, but
well, certainly you can serialize into a byte stream and encode it using
base64.
Jian
On 9/20/05, Mordo, Aviran (EXP N-NANNATEK) [EMAIL PROTECTED] wrote:
I can't think of a way you can use serialization, since lucene only
works with strings.
-Original Message-
From: Tricia
Hi,
I am playing with Lucene source code and have this somewhat stupid question,
so please bear with me ;-)
Basically, I want to implement a custom ranking algorithm. That is,
iterating through the documents that contains all the search keywords, for
each document, retrieve its inverted
Hi,
I think Lucene transforms the prefix match query into all sub queries where
the searching for a prefix could result into search for all terms that begin
with that prefix.
For postfix match, I think you need to do more work than relying on
Lucene's query parser.
You can iterate over the
Hi,
It seems to me that in theory, Lucene storage code could use true UTF-8 to
store terms. Maybe it is just a legacy issue that the modified UTF-8 is
used?
Cheers,
Jian
On 8/26/05, Marvin Humphrey [EMAIL PROTECTED] wrote:
Greets,
[crossposted to java-user@lucene.apache.org and [EMAIL
Hi,
It seems this problem only happens when the index files get really large.
Could it be because java has trouble handling very large files on windows
machine (guess there is max file size on windows)?
In Lucene, I think there is a maxDoc kind of parameter that you can use to
specify, when
Hi, Erik,
I some time ago played with the Lucene 1.2 source code and made some
modifications to it, trying to add my own ranking algorithm. I am not sure
if Licence wise, it is permissible to modify the earlier source code, also
if it is allowed to put the modified version or the description
Hi,
I don't think by default it does so. But, you can certainly serialize
the java object and use base 64 to encode it into a text string, then,
you can store it as a field.
Cheers,
Jian
On 8/25/05, Kevin L. Cobb [EMAIL PROTECTED] wrote:
I just had a thought this morning. Does Lucene have the
Hi,
I am also interested in that. I haven't used Derby before, but it
seems the java database of choice as it is open source and a full
relational database.
I plant to learn the simple usage of Derby and then think about
integrating Derby with Lucene.
May we should post our progress for the
asked the kind people of Derby-users, and they say there is no
solution for this yet.
I guess we can ask the people on the -developer list
On 8/13/05, jian chen [EMAIL PROTECTED] wrote:
Hi,
I am also interested in that. I haven't used Derby before, but it
seems the java
Well, the good practice I think is to decouple the backend from the
front end as much as possible. You might have different versions of
java running for each end and also, there might be code compatibility
issues with different versions.
Jian
On 8/10/05, Andrew Boyd [EMAIL PROTECTED] wrote:
Hi, Dan,
I think the problem you mentioned is the one that has been discussed
lot of times in this mailing list.
Bottomline is that you'd better use the compound file format to store
indexes. I am not sure Lucene 1.3 has that available, but, if
possible, can you upgrade to lucene 1.4.3?
Cheers,
able to manually edit with a hex editor.
Otis
--- jian chen [EMAIL PROTECTED] wrote:
Hi,
I know Lucene does not have transaction support at this stage.
However, I want to know what will happen if there is an operating
system crash during the indexing process, will the Lucene
, they are irrelevant.
Otis
P.S.
Did you ask you locking in Lucene the other day?
--- jian chen [EMAIL PROTECTED] wrote:
Hi, Otis,
Thanks for your email. As this is very important for using Lucene in
our production system, I looked at the code to try to understand.
Here
is my observation
Yeah, RDBMS makes sense. In this case, would it be better to simple
store those in a relational database and just use Lucene to do
indexing for the text?
Cheers,
Jian
On 7/7/05, Leos Literak [EMAIL PROTECTED] wrote:
I know the answear, but just for curiosity:
have you guys ever thought
Well, I guess Lucene's Span query uses the Cover Density based model
(proximity model). However, it is within the framework of the TF*IDF
as well.
Jian
On 7/4/05, Dave Kor [EMAIL PROTECTED] wrote:
Quoting [EMAIL PROTECTED]:
Hi everybody,
which kind of retrieval model is lucene using? Is
to retrive the original document sometimes. I did not quite understand
your second suggestion.
Can you please help me understand better, a pointer to some web resource will
also help.
jian chen [EMAIL PROTECTED] wrote:
Hi,
Depending on the operating system, there might be a hard limit
Hi,
I am looking at and trying to understand more about Lucene's
reader/writer synchronization. Does anyone know when the commit.lock
is release? I could not find it anywhere in the source code.
I did see the write.lock is released in IndexWriter.close().
Thanks,
Hi, Naimdjon,
I have some suggestions as well along the lines of Mark Harwood.
As an example, suppose for each hotel room there is a description, and
you want the user to do free text search on the description field.
You could do the following:
1) store hotel room reservation info as rows in
Hi,
I would use pure span or cover density based ranking algorithm which
do not take document length into consideration. (tweaking whatever
currently in the standard Lucene distribution?)
For example, searching for the keywords beautiful house, span/cover
ranking will treat a long document and a
Hi,
Depending on the operating system, there might be a hard limit on the
number of files in one directory (windoze versions). Even with
operating systems that don't have a hard limit, it is still better not
to put too many files in one directory (linux).
Typically, the file system won't be
Hi,
Recently I looked at the locking mechanism of Lucene. If I am correct,
I think the process for grabbing the lock file will time out by
default in 10 seconds. When the process timed out, it will print out
the IOException.
The lucene locking mechanism is not within threads in the same JVM. It
Hi,
I haven't heard anything back. Probably this email got lost on the way
or whatsoever.
Anyway, could anyone enlighten me on this?
Thanks,
Jian
-- Forwarded message --
From: jian chen [EMAIL PROTECTED]
Date: Jun 26, 2005 12:59 PM
Subject: when is the commit.lock released
Hi,
I am looking at and trying to understand more about Lucene's
reader/writer synchronization. Does anyone know when the commit.lock
is release? I could not find it anywhere in the source code.
I did see the write.lock is released in IndexWriter.close().
Thanks,
Jian
Hi,
I think Span query in general should do more work than simple Phrase
query. Phrase query, in its simplest form, should just try to find all
terms that are adjacent to each other. Meanwhile, Span query does not
necessary be adjacent to each other, but, with other words in between.
Therefore,
Hi,
I have a stupid question regarding the transient nature of the document ids.
As I understand, documents will obtain new doc ids during the index
merge. Suppose if you do a search and got the Hits object. When you
iterate through the documents by id, the index merge happens. How the
merge
Hi,
You may look at this website
http://www.zilverline.org
Cheers,
Jian
On 6/21/05, Markus Atteneder [EMAIL PROTECTED] wrote:
I am looking for a SearchEngine for our Intranet and so i deal with Lucene.
I have read the FAQ and some Postings and i got first experiences with it
and now i have
Hi,
Interesting topic. I thought about this as well. I wanted to index
Chinese text with English, i.e., I want to treat the English text
inside Chinese text as English tokens rather than Chinese text tokens.
Right now I think maybe I have to write a special analyzer that takes
the text input,
characters into separate tokens also.
Erik
On May 31, 2005, at 5:49 PM, jian chen wrote:
Hi,
Interesting topic. I thought about this as well. I wanted to index
Chinese text with English, i.e., I want to treat the English text
inside Chinese text as English tokens rather
38 matches
Mail list logo