Re: lucene memory consumption

2008-05-29 Thread jian chen
Not that I can think about. But, if you have any cached field data, norms array, that could be huge. Would be interested in knowing from others regarding this topic as well. Jian On 5/29/08, Alex [EMAIL PROTECTED] wrote: Hi, other than the in memory terms (.tii), and the few kilobytes of

simultaneous read and writes to the RAMDirectory

2008-05-16 Thread jian chen
Lucene gurus, I have a question regarding RAMDirectory usage. Can the IndexWriter keep adding documents to the index meanwhile the IndexReader is open on this RAMDirectory and searches going on? I know in a FSDirectory case, the IndexWriter can add documents to the index meanwhile IndexReader

two copies of indexes vs. master/slave indexes

2008-05-16 Thread jian chen
I have seen two different designs for incremental index updates. 1) Have two copies of indexes A and B. The incremental updates happens on A index while B index is being used for search. Then, hot swap the two indexes. Bring B index up to date and perform incremental updates thereafter. In this

Re: Build vs. Buy?

2006-02-10 Thread jian chen
For reading word document as text, you can try AntiWord. I have written a simplified Lucene that does Max words match. For example, if you are searching for aa, bb, cc, then, the document that contains all words (aa, bb, cc) will be definitely ranked higher than documents containing either aa,

Re: Urgent - File Lock in Lucene 1.2

2005-11-21 Thread jian chen
Hi, Karl, Therer have been quite some discussions regarding the too many open files problem. From my understanding, it is due to Lucene trying to open multiple segments at the same time (during search/merging segments), and the operating system wouldn't allow opening that many file handles. If

Re: List of removed stop words?

2005-10-31 Thread jian chen
Hi, In case you are using StandardAnalyzer, there is a stop word list. I have used StandardAnalyzer.STOP_WORDS, which is a String[]. Cheers, Jian On 10/31/05, Rob Young [EMAIL PROTECTED] wrote: Hi, Is there an easy way to list stop words that were removed from a string? I'm using the

Re: trying to boost a phrase higher than its individual words

2005-10-27 Thread jian chen
Hi, It seems what you want to achieve could be implemented using the Cover Density algorithm. I am not sure if any existing query classes in the Lucene distribution does this already. But in case not, this is what I am think about: Make a custom query class, called CoverDensityQuery, which is

Re: java on 64 bits

2005-10-21 Thread jian chen
Hi, Also, I think you may try to increase the indexInterval, it is set to 128, but getting it larger, the .tii files will be smaller. Since .tii files are loaded into memory as a whole, so, your memory usage might be smaller. However, this change might affect your search speed. So, be careful

Re: Large queries

2005-10-16 Thread jian chen
Hi, Trond, It should be no problem for Lucene to handle 6 million documents. For your query, it seems you want to do a disjunctive (or'ed) query for multiple terms, 10 terms or 1 terms for example. The worst case I can think of is, you can very easily write your own query class to handle

Re: maximum number of documents

2005-10-12 Thread jian chen
Hi, Koji, I think you are right, the max num of documents should be Integer.MAX_VALUE. Some more points below: 1) I double checked the Lucene documentation. It mentioned in the file format that SegSize is UInt32. I don't think this is accurate, as UInt32 is around 4 billion, but

Re: Storing HashMap as an UnIndexed Field

2005-09-20 Thread jian chen
well, certainly you can serialize into a byte stream and encode it using base64. Jian On 9/20/05, Mordo, Aviran (EXP N-NANNATEK) [EMAIL PROTECTED] wrote: I can't think of a way you can use serialization, since lucene only works with strings. -Original Message- From: Tricia

storing inverted document as a field

2005-09-19 Thread jian chen
Hi, I am playing with Lucene source code and have this somewhat stupid question, so please bear with me ;-) Basically, I want to implement a custom ranking algorithm. That is, iterating through the documents that contains all the search keywords, for each document, retrieve its inverted

Re: Small problem in searching

2005-09-15 Thread jian chen
Hi, I think Lucene transforms the prefix match query into all sub queries where the searching for a prefix could result into search for all terms that begin with that prefix. For postfix match, I think you need to do more work than relying on Lucene's query parser. You can iterate over the

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi, It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? Cheers, Jian On 8/26/05, Marvin Humphrey [EMAIL PROTECTED] wrote: Greets, [crossposted to java-user@lucene.apache.org and [EMAIL

Re: read past EOF

2005-08-27 Thread jian chen
Hi, It seems this problem only happens when the index files get really large. Could it be because java has trouble handling very large files on windows machine (guess there is max file size on windows)? In Lucene, I think there is a maxDoc kind of parameter that you can use to specify, when

Re: Books about Lucene?

2005-08-26 Thread jian chen
Hi, Erik, I some time ago played with the Lucene 1.2 source code and made some modifications to it, trying to add my own ranking algorithm. I am not sure if Licence wise, it is permissible to modify the earlier source code, also if it is allowed to put the modified version or the description

Re: Serialized Java Objects

2005-08-25 Thread jian chen
Hi, I don't think by default it does so. But, you can certainly serialize the java object and use base 64 to encode it into a text string, then, you can store it as a field. Cheers, Jian On 8/25/05, Kevin L. Cobb [EMAIL PROTECTED] wrote: I just had a thought this morning. Does Lucene have the

Re: Integrate Lucene with Derby

2005-08-13 Thread jian chen
Hi, I am also interested in that. I haven't used Derby before, but it seems the java database of choice as it is open source and a full relational database. I plant to learn the simple usage of Derby and then think about integrating Derby with Lucene. May we should post our progress for the

Re: Integrate Lucene with Derby

2005-08-13 Thread jian chen
asked the kind people of Derby-users, and they say there is no solution for this yet. I guess we can ask the people on the -developer list On 8/13/05, jian chen [EMAIL PROTECTED] wrote: Hi, I am also interested in that. I haven't used Derby before, but it seems the java

Re: DOM or XML representation of a query?

2005-08-10 Thread jian chen
Well, the good practice I think is to decouple the backend from the front end as much as possible. You might have different versions of java running for each end and also, there might be code compatibility issues with different versions. Jian On 8/10/05, Andrew Boyd [EMAIL PROTECTED] wrote:

Re: Too many open files error using tomcat and lucene

2005-07-20 Thread jian chen
Hi, Dan, I think the problem you mentioned is the one that has been discussed lot of times in this mailing list. Bottomline is that you'd better use the compound file format to store indexes. I am not sure Lucene 1.3 has that available, but, if possible, can you upgrade to lucene 1.4.3? Cheers,

Re: Lucene index integrity during a system crash

2005-07-16 Thread jian chen
able to manually edit with a hex editor. Otis --- jian chen [EMAIL PROTECTED] wrote: Hi, I know Lucene does not have transaction support at this stage. However, I want to know what will happen if there is an operating system crash during the indexing process, will the Lucene

Re: Lucene index integrity during a system crash

2005-07-16 Thread jian chen
, they are irrelevant. Otis P.S. Did you ask you locking in Lucene the other day? --- jian chen [EMAIL PROTECTED] wrote: Hi, Otis, Thanks for your email. As this is very important for using Lucene in our production system, I looked at the code to try to understand. Here is my observation

Re: non-lexical comparisons

2005-07-07 Thread jian chen
Yeah, RDBMS makes sense. In this case, would it be better to simple store those in a relational database and just use Lucene to do indexing for the text? Cheers, Jian On 7/7/05, Leos Literak [EMAIL PROTECTED] wrote: I know the answear, but just for curiosity: have you guys ever thought

Re: Retrieval model used by Lucene

2005-07-04 Thread jian chen
Well, I guess Lucene's Span query uses the Cover Density based model (proximity model). However, it is within the framework of the TF*IDF as well. Jian On 7/4/05, Dave Kor [EMAIL PROTECTED] wrote: Quoting [EMAIL PROTECTED]: Hi everybody, which kind of retrieval model is lucene using? Is

Re: No.of Files in Directory

2005-06-30 Thread jian chen
to retrive the original document sometimes. I did not quite understand your second suggestion. Can you please help me understand better, a pointer to some web resource will also help. jian chen [EMAIL PROTECTED] wrote: Hi, Depending on the operating system, there might be a hard limit

question regarding the commit.lock

2005-06-29 Thread jian chen
Hi, I am looking at and trying to understand more about Lucene's reader/writer synchronization. Does anyone know when the commit.lock is release? I could not find it anywhere in the source code. I did see the write.lock is released in IndexWriter.close(). Thanks,

Re: Design question [too many fields?]

2005-06-29 Thread jian chen
Hi, Naimdjon, I have some suggestions as well along the lines of Mark Harwood. As an example, suppose for each hotel room there is a description, and you want the user to do free text search on the description field. You could do the following: 1) store hotel room reservation info as rows in

Re: Strategy for making short documents not bubble to the top?

2005-06-29 Thread jian chen
Hi, I would use pure span or cover density based ranking algorithm which do not take document length into consideration. (tweaking whatever currently in the standard Lucene distribution?) For example, searching for the keywords beautiful house, span/cover ranking will treat a long document and a

Re: No.of Files in Directory

2005-06-29 Thread jian chen
Hi, Depending on the operating system, there might be a hard limit on the number of files in one directory (windoze versions). Even with operating systems that don't have a hard limit, it is still better not to put too many files in one directory (linux). Typically, the file system won't be

Re: Lock File exceptions

2005-06-27 Thread jian chen
Hi, Recently I looked at the locking mechanism of Lucene. If I am correct, I think the process for grabbing the lock file will time out by default in 10 seconds. When the process timed out, it will print out the IOException. The lucene locking mechanism is not within threads in the same JVM. It

Fwd: when is the commit.lock released?

2005-06-27 Thread jian chen
Hi, I haven't heard anything back. Probably this email got lost on the way or whatsoever. Anyway, could anyone enlighten me on this? Thanks, Jian -- Forwarded message -- From: jian chen [EMAIL PROTECTED] Date: Jun 26, 2005 12:59 PM Subject: when is the commit.lock released

when is the commit.lock released?

2005-06-26 Thread jian chen
Hi, I am looking at and trying to understand more about Lucene's reader/writer synchronization. Does anyone know when the commit.lock is release? I could not find it anywhere in the source code. I did see the write.lock is released in IndexWriter.close(). Thanks, Jian

Re: Span query performance issue

2005-06-24 Thread jian chen
Hi, I think Span query in general should do more work than simple Phrase query. Phrase query, in its simplest form, should just try to find all terms that are adjacent to each other. Meanwhile, Span query does not necessary be adjacent to each other, but, with other words in between. Therefore,

document ids in cached in Hits and index merge

2005-06-24 Thread jian chen
Hi, I have a stupid question regarding the transient nature of the document ids. As I understand, documents will obtain new doc ids during the index merge. Suppose if you do a search and got the Hits object. When you iterate through the documents by id, the index merge happens. How the merge

Re: Updateing Documents:

2005-06-21 Thread jian chen
Hi, You may look at this website http://www.zilverline.org Cheers, Jian On 6/21/05, Markus Atteneder [EMAIL PROTECTED] wrote: I am looking for a SearchEngine for our Intranet and so i deal with Lucene. I have read the FAQ and some Postings and i got first experiences with it and now i have

Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather than Chinese text tokens. Right now I think maybe I have to write a special analyzer that takes the text input,

Re: Indexing multiple languages

2005-05-31 Thread jian chen
characters into separate tokens also. Erik On May 31, 2005, at 5:49 PM, jian chen wrote: Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather