Re: lucene memory consumption

2008-05-29 Thread jian chen
Not that I can think about. But, if you have any cached field data,
norms array, that could be huge.

Would be interested in knowing from others regarding this topic as well.

Jian

On 5/29/08, Alex [EMAIL PROTECTED] wrote:

 Hi,
 other than the in memory terms (.tii), and the few kilobytes of opened file
 buffer, where are some other sources of significant memory consumption
 when searching on a large index ?  ( 100GB). The queries are just normal
 term queries.


 _
 隨身的 Windows Live Messenger 和 Hotmail,不限時地掌握資訊盡在指間 — Windows Live for Mobile
 http://www.msn.com.tw/msnmobile/

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




simultaneous read and writes to the RAMDirectory

2008-05-16 Thread jian chen
Lucene gurus,

I have a question regarding RAMDirectory usage. Can the IndexWriter keep
adding documents to the index meanwhile the IndexReader is open on this
RAMDirectory and searches going on?

I know in a FSDirectory case, the IndexWriter can add documents to the index
meanwhile IndexReader reads from the index. This is because the IndexWriter
just writes new index files rather than modifying existing index files. The
only place (I remember) that the new and old indexes will conflict is the
segment file. Again, once the IndexWriter commits the change (by calling
close() method), the segment.new is renamed to segment atomically. Since the
old segment file is cached in memory by the IndexReader, so not a problem
for the IndexReader to serve search requests. The old segment file is cached
in memory, the other files pointed to by the old segment file are cached by
Linux anyway, or not removed by windows due to still being used.

Anyway, back to the RAMDirectory case. Having an IndexReader open while
IndexWriter adding documents to it, will that cause any issue?

Thanks,

Jian


two copies of indexes vs. master/slave indexes

2008-05-16 Thread jian chen
I have seen two different designs for incremental index updates.

1) Have two copies of indexes A and B. The incremental updates happens on A
index while B index is being used for search. Then, hot swap the two
indexes. Bring B index up to date and perform incremental updates
thereafter.  In this scenario, searches are performed on index A or B
alternatively.

2) Have a master index where the incremental updates are applied. Then, the
slave indexes got synced up with the master index. Searches are performed
only on the slave indexes.

So, I want to know what are the trade offs between the two approaches? For
scalability, what's the best approach?

Thanks,

Jian


Re: Build vs. Buy?

2006-02-10 Thread jian chen
For reading word document as text, you can try AntiWord.

I have written a simplified Lucene that does Max words match.

For example, if you are searching for aa, bb, cc, then, the document that
contains all words (aa, bb, cc) will be definitely ranked higher than
documents containing either aa, bb or aa, cc or bb, cc.

I am going to put up the code as open source. If you are interested, you can
email me directly.

Jian


On 2/9/06, P. Alex. Salamanca R. [EMAIL PROTECTED] wrote:

 On the other hand, if you want be the most cheapest, why don't give a
 chance
 to google search appliance?




Re: Urgent - File Lock in Lucene 1.2

2005-11-21 Thread jian chen
Hi, Karl,

Therer have been quite some discussions regarding the too many open files
problem. From my understanding, it is due to Lucene trying to open multiple
segments at the same time (during search/merging segments), and the
operating system wouldn't allow opening that many file handles.

If you have a lot of fields, each will have its own file (set of files,
maybe? I couldn't remember). This could cause the above issue.

The way to fix this, is to combine all the files for each segment into one
physical file. When the physical file is open, multiple streams will be read
from the physical file. This fix went into Lucene 1.4 I think but not
available for Lucene 1.2.

Currently I am trying to find some spare time so that I could port the
compound file format (.cfs) feature from Lucene 1.4 to Lucene 1.2.

Hope this information could help you.

Cheers,

Jian


On 11/20/05, Karl Koch [EMAIL PROTECTED] wrote:

 Hello group,

 I am running Lucene 1.2 and I have the following error message. I got this
 message when performing a search:

 Failed to obtain file lock on /tmp/qcop-msg-qpe

 I am running Lucene 1.2 on a Sharp Zaurus PDA with embedded Linux.

 When I look through the exceptions I have before that I can see that I
 have
 an IOException Too many open files happening somewhere in the
 FSDirectory...


 Regards,
 Karl


 --
 Telefonieren Sie schon oder sparen Sie noch?
 NEU: GMX Phone_Flat http://www.gmx.net/de/go/telefonie

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: List of removed stop words?

2005-10-31 Thread jian chen
Hi,

In case you are using StandardAnalyzer, there is a stop word list. I have
used StandardAnalyzer.STOP_WORDS, which is a String[].

Cheers,

Jian

On 10/31/05, Rob Young [EMAIL PROTECTED] wrote:

 Hi,

 Is there an easy way to list stop words that were removed from a string?
 I'm using the standard analyzer on user's searchstrings and I would like
 to let them know when stop words have been removed (ala google). Any
 ideas?

 Cheers
 Rob

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: trying to boost a phrase higher than its individual words

2005-10-27 Thread jian chen
Hi,

It seems what you want to achieve could be implemented using the Cover
Density algorithm. I am not sure if any existing query classes in the Lucene
distribution does this already. But in case not, this is what I am think
about:

Make a custom query class, called CoverDensityQuery, which is modeled after
PhraseQuery.

The CoverDensityQuery could accept two arguments as its constructor, the
Terms and the numOfTermsMatched.

For example, to search for classical music, you will first construct
CoverDensityQuery like:
new CoverDensityQuery(new String[]{classical, music}, 2);

This should return all documents that contain both classical and music.
The ranking will be based on covers, each cover is a span with the two terms
at each end. The shorter the cover, the higher the rank, the more the
covers, the higher the rank.

If the returned documents are not enough, then, do another query like:
new CoverDensityQuery(new String[]{classical, music}, 1);

This should return documents either containing classical or music, but
not both.

The detailed algorithm will be constructed similar to PhraseQuery.

I will write such a query class in the future, just as a proof of concept
for cover density algorithm.

Cheers,

Jian

On 10/27/05, Andy Lee [EMAIL PROTECTED] wrote:

 I have a situation where I want to search for individual words in a
 phrase as well as the phrase itself. For example, if the user enters
 [classical music] (with quotes) I want to find documents that
 contain classical music (the phrase) *and* the individual words
 classical and music.

 Of course, I could just search for the individual words and the
 phrase would get found as a consequence. But I want documents
 containing the phrase to appear first in the search results, since
 the phrase is the user's primary interest.

 I've constructed the following query, using boost values...

 [+(content:classical music^5.0 content:classical^0.1
 content:music^0.1)]

 ...but the boost values don't seem to affect the order of the search
 results.

 Am I misunderstanding the purpose or proper usage of boosts, and if
 so, can someone explain (at least roughly) how to achieve the desired
 result?

 --Andy


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: java on 64 bits

2005-10-21 Thread jian chen
Hi,

Also, I think you may try to increase the indexInterval, it is set to 128,
but getting it larger, the .tii files will be smaller. Since .tii files are
loaded into memory as a whole, so, your memory usage might be smaller.
However, this change might affect your search speed. So, be careful about
the value you want to set, not too high though.

Just my thoughts, hope helps.

Jian

On 10/21/05, Aigner, Thomas [EMAIL PROTECTED] wrote:

 I have seen quite a few posts on using the 1.9 dev version for
 production uses. How stable is it? Is it really ready for production?
 I would like to use it.. but I never ever put beta packages in
 procution.. but then again.. I'm always dealing with Microsoft :)

 Tom

 -Original Message-
 From: Yonik Seeley [mailto:[EMAIL PROTECTED]
 Sent: Friday, October 21, 2005 9:28 AM
 To: java-user@lucene.apache.org
 Subject: Re: java on 64 bits

 1) make sure the failure was due to an OutOfMemory exception and not
 something else.
 2) if you have enough memory, increase the max JVM heap size (-Xmx)
 3) if you don't need more than 1.5G or so of heap, use the 32 bit JVM
 instead (depending on architecture, it can acutally be a little faster
 because more references fit in the CPU cache).
 4) see how many indexed fields you have and if you can consolidate any
 of
 them
 4.5) if you don't have too many indexed fields, and have enough spare
 file
 descriptors, try using the non-compound file format instead.
 5) run with the latest version of lucene (1.9 dev version) which may
 have
 better memory usage during optimizes  segment merges.
 6) If/when optional norms
 http://issues.apache.org/jira/browse/LUCENE-448
 makes it into lucene, you can apply it to any indexed fields for which
 you
 don't need index-time boosting or length normalization.

 As for getting rid of your current intermediate files, I'd rebuild from
 scratch just to ensure things are OK.

 -Yonik
 Now hiring -- http://tinyurl.com/7m67g

 On 10/21/05, Roxana Angheluta [EMAIL PROTECTED] wrote:
 
  Thank you, Yonik, it seems this is the case.
  What can we do in this case? Would running the program with java -d32
 be
  a solution?
 
  Thanks again,
  roxana
  One possibility: if lucene runs out of memory while adding or
 optimizing,
  it
  can leave unused files beind that increase the size of the index. A
 64
  bit
  JVM will require more memory than a 32 bit one due to the size of all
  references being doubled.
  
  If you are using the compound file format (the default - check for
 .cfs
  files), then it's easy to check if you have this problem by seeing if
  there
  are any *.f* files in the index directory. These are intermediate
 files
  and
  shouldn't exist for long in a compound-file index.
  
  -Yonik
  Now hiring -- http://tinyurl.com/7m67g
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Large queries

2005-10-16 Thread jian chen
Hi, Trond,

It should be no problem for Lucene to handle 6 million documents.

For your query, it seems you want to do a disjunctive (or'ed) query for
multiple terms, 10 terms or 1 terms for example. The worst case I can
think of is, you can very easily write your own query class to handle this,
utilizing the TermDocs iterator class.

Say, if you want to have one of 10 docID's, you have 10 TermDocs. Each
TermDoc corresponds to a term. Then, you can do a multi-way (in this case,
10 way) merge of these 10 TermDocs, and generate a final list of the doc
ids.

I suggest that you can look at PhraseQuery and PhraseScrorer to see how it
does the conjunctive merge to find the docs that contains all the terms. In
your case, instead of doing intersection, you are doing a union of all the
term docs, right?

Maybe there is already some query class that comes with the Lucene package
that does this. However, the method I described should also help just in
case.

Cheers,

Jian


On 10/16/05, Trond Aksel Myklebust [EMAIL PROTECTED] wrote:

 How is Lucene handling very large queries? I have 6million documents,
 which
 each has a docID field. There is a total of 2 distinct docID's, so
 many documents got the same docID which consists of a filename (only name,
 not path).

 Sometimes, I must get all documents that has one of 10 docID's, and
 sometimes I need to get all documents that has one of 1 docIDs. Is
 there
 any other way than doing a query: docID:(file1 file2 file3 file4..) ?



 Trond A Myklebust








Re: maximum number of documents

2005-10-12 Thread jian chen
Hi, Koji,

I think you are right, the max num of documents should be Integer.MAX_VALUE.


Some more points below:

1) I double checked the Lucene documentation. It mentioned in the file
format that SegSize is UInt32. I don't think this is accurate, as UInt32 is
around 4 billion, but Integer.MAX_VALUE is half of that, around 2 billion.

In java, there is no notion of unsigned integer, so, since Lucene uses
integer to store doc ids, the max you can get is therefore 2 billion.

Maybe the documentation could mention it in more detail? Specifically, the
actual max number of a document id 2147483647 could be mentioned?

2) I think in theory, if you index 8 billion docs, you can use 4 indexes,
and when you do the search, just search all 4 indexes and combine the result
set.

3) Looking at the Lucene source code, it seems not that difficult to change
the doc id to use Long instead. It occurs to me that the OutputStream's
writeVInt and writeVLong are using exactly the same code. So, there should
be no performance penalty to switch to using Long.

4) However, if you have 8 billion to index, just changing doc id to use Long
is not enough I guess. You may also need to adjust other parameters, such as
the IndexInterval (for storing the term info index). Because the term info
index (tii) is loaded into memory totally, so, instead of leaving it as 128,
you may have to change it to 256 or bigger, to avoid out of memory issue.

Cheers,

Jian

On 10/12/05, Koji Sekiguchi [EMAIL PROTECTED] wrote:

 Hello,

 Is the maximum number of documents in an index Integer.MAX_VALUE? (approx
 2
 billion)
 If so, if I want to have 8 billion docs indexed, like Google,
 can I do it with having four indices, theoretically?

 Koji




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Storing HashMap as an UnIndexed Field

2005-09-20 Thread jian chen
well, certainly you can serialize into a byte stream and encode it using 
base64.

Jian

On 9/20/05, Mordo, Aviran (EXP N-NANNATEK) [EMAIL PROTECTED] wrote:
 
 I can't think of a way you can use serialization, since lucene only
 works with strings.
 
 -Original Message-
 From: Tricia Williams [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, September 20, 2005 3:30 PM
 To: java-user@lucene.apache.org
 Subject: RE: Storing HashMap as an UnIndexed Field
 
 Do you think there is anyway that I could use the serialization already
 built into the HashMap data structure?
 
 On Tue, 20 Sep 2005, Mordo, Aviran (EXP N-NANNATEK) wrote:
 
  You can store the values as a coma separated string (which then you'll
 
  need to parse manually back to a HashMap)
 
  -Original Message-
  From: Tricia Williams [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, September 20, 2005 3:14 PM
  To: java-user@lucene.apache.org
  Subject: Storing HashMap as an UnIndexed Field
 
  Hi,
 
  I'd like to store a HashMap for some extra data to be used when a
  given document is retrieved as a Hit for a query. To add an UnIndexed
 
  Field to an index takes only Strings as parameters. Does anyone have
  any suggestions on how I might convert the HashMap to a String that is
 
  efficiently recomposed into the desired HashMap on the other end?
 
  Thanks,
  Tricia
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 



storing inverted document as a field

2005-09-19 Thread jian chen
Hi, 

I am playing with Lucene source code and have this somewhat stupid question, 
so please bear with me ;-)

Basically, I want to implement a custom ranking algorithm. That is, 
iterating through the documents that contains all the search keywords, for 
each document, retrieve its inverted document and rank it based on the 
inverted document as a whole.

Because of this thought, I want to store the inverted document for each 
document as a field. 

My question is, is this kind of data structure fast enough for searching, 
compared to the current Lucene approach where the proximity data is stored 
in the .prx files?

I know Lucene has (sloppy) phrase query, span query, but I am trying to be 
more familar with Lucene by implementing a custom query. 

Thanks in advance for any suggestion or enlightenment!

Jian


Re: Small problem in searching

2005-09-15 Thread jian chen
Hi,

I think Lucene transforms the prefix match query into all sub queries where 
the searching for a prefix could result into search for all terms that begin 
with that prefix.

For postfix match, I think you need to do more work than relying on 
Lucene's query parser. 

You can iterate over the terms and do an endsWith() call, and if there is 
a match, then, perform a normal Lucene search for that term. 

So, effectively, you do the same thing as prefix match, conceptually loop 
over all available terms in your dictionary and find all the terms to be 
prepared for actual searching.

This might be slow. What you might want to speed up the performance is, you 
can store all the available terms in-memory, and looping through all unique 
terms is a breeze. This is what google used for their prototype search 
engine when they were way back in the 1998s. (I guess :-)

Cheers,

Jian

On 9/15/05, tirupathi reddy [EMAIL PROTECTED] wrote:
 
 Hi guys,
 
 I have some problem while searching using Lucene. Say I have some thing 
 like tirupathireddy or venkatreddy in the index. When i search for 
 string reddy I have to get those things (i.e. tirupathireddy and 
 venkatreddy). I have read in Query syntax of Lucene that * will not be 
 given at the starting of the search string. SO how can I achiev that. I am 
 in very much need of that. So please help me out.
 
 
 WIth Regards,
 TirupatiReddy Manyam.
 
 
 Tirupati Reddy Manyam
 24-06-08,
 Sundugaullee-24,
 79110 Freiburg
 GERMANY.
 
 Phone: 00497618811257
 cell : 004917624649007
 
 
 -
 Yahoo! for Good
 Click here to donate to the Hurricane Katrina relief effort.



Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi,

It seems to me that in theory, Lucene storage code could use true UTF-8 to 
store terms. Maybe it is just a legacy issue that the modified UTF-8 is 
used?

Cheers,

Jian

On 8/26/05, Marvin Humphrey [EMAIL PROTECTED] wrote:
 
 Greets,
 
 [crossposted to java-user@lucene.apache.org and [EMAIL PROTECTED]
 
 I've delved into the matter of Lucene and UTF-8 a little further, and
 I am discouraged by what I believe I've uncovered.
 
 Lucene should not be advertising that it uses standard UTF-8 -- or
 even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8. The
 two distinguishing characteristics of Modified UTF-8 are the
 treatment of codepoints above the BMP (which are written as surrogate
 pairs), and the encoding of null bytes as 1100  1000  rather
 than  . Both of these became illegal as of Unicode 3.1
 (IIRC), because they are not shortest-form and non-shortest-form
 UTF-8 presents a security risk.
 
 The documentation should really state that Lucene stores strings in a
 Java-only adulteration of UTF-8, unsuitable for interchange. Since
 Perl uses true shortest-form UTF-8 as its native encoding, Plucene
 would have to jump through two efficiency-killing hoops in order to
 write files that would not choke Lucene: instead of writing out its
 true, legal UTF-8 directly, it would be necessary to first translate
 to UTF-16, then duplicate the Lucene encoding algorithm from
 OutputStream. In theory.
 
 Below you will find a simple Perl script which illustrates what
 happens when Perl encounters malformed UTF-8. Run it (you need Perl
 5.8 or higher) and you will see why even if I thought it was a good
 idea to emulate the Java hack for encoding Modified UTF-8, trying
 to make it work in practice would be a nightmare.
 
 If Plucene were to write legal UTF-8 strings to its index files, Java
 Lucene would misbehave and possibly blow up any time a string
 contained either a 4-byte character or a null byte. On the flip
 side, Perl will spew warnings like crazy and possibly blow up
 whenever it encounters a Lucene-encoded null or surrogate pair. The
 potential blowups are due to the fact that Lucene and Plucene will
 not agree on how many characters a string contains, resulting in
 overruns or underruns.
 
 I am hoping that the answer to this will be a fix to the encoding
 mechanism in Lucene so that it really does use legal UTF-8. The most
 efficient way to go about this has not yet presented itself.
 
 Marvin Humphrey
 Rectangular Research
 http://www.rectangular.com/
 
 #
 
 #!/usr/bin/perl
 use strict;
 use warnings;
 
 # illegal_null.plx -- Perl complains about non-shortest-form null.
 
 my $data = foo\xC0\x80\n;
 
 open (my $virtual_filehandle, +:utf8, \$data);
 print $virtual_filehandle;
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: read past EOF

2005-08-27 Thread jian chen
Hi,

It seems this problem only happens when the index files get really large. 
Could it be because java has trouble handling very large files on windows 
machine (guess there is max file size on windows)?

In Lucene, I think there is a maxDoc kind of parameter that you can use to 
specify, when the index gets really large containing more than that many of 
documents, it will not try to merge the index files into one. Could this be 
used to stop the index files from growing forever?

Cheers,

Jian

On 8/27/05, Ouyang, Hui [EMAIL PROTECTED] wrote:
 
 Hi,
 I had lots of docs out of order issues when the index is optimized. I 
 did the changes based on the suggestion in this link
 http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23650
 
 It seems this issue is solved. But some index have read past EOF when I 
 do optimization. The index is over 2G and there are some documents deleted 
 from the index. It is based on Lucene 1.4.3 on Windows.
 Does anyone know how to avoid this issue? Thx.
 
 Regards,
 hui
 
 
 
 
 merging segments _1ny5 (38708 docs) _1ot0 (1000 docs) _1t2m (4810 
 docs)java.io.I
 
 OException: read past EOF
 
 at
 
 org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal
 
 (CompoundFileReader.java:218)
 
 at
 
 org.apache.lucene.store.InputStream.readBytes(InputStream.java:61)
 
 at
 
 org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356)
 
 at
 
 org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:323)
 
 at
 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:4
 
 29)
 
 at
 
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:94)
 
 at
 
 org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:51
 
 0)
 
 at
 
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:370)
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: Books about Lucene?

2005-08-26 Thread jian chen
Hi, Erik,

I some time ago played with the Lucene 1.2 source code and made some 
modifications to it, trying to add my own ranking algorithm. I am not sure 
if Licence wise, it is permissible to modify the earlier source code, also 
if it is allowed to put the modified version or the description of what I 
have done on wiki?

Thanks for your reply.

Jian


On 8/26/05, Erik Hatcher [EMAIL PROTECTED] wrote:
 
 I appreciate the vote of confidence on this, but I am not afraid to
 admit that I do not consider myself an expert on the deep innards of
 Lucene. I understand the concepts, and a bit of the internals, but I
 certainly do not live up to the hype you just bestowed upon me. *blush*
 
 Regarding JDK 1.2 - I came to Java at 1.3, and have never used a JDK
 earlier than that. All the apps I build now are currently on JDK 1.5
 (err... 5.0). I do not currently know what would be involved in
 running Lucene on a 1.2 VM. The first question to ask is whether an
 earlier version of Lucene is sufficient for the needs of those
 constrained to JDK 1.2. If not, then we move forward to defining
 what needs to be changed - a simple compilation of the trunk source
 code with a 1.2 VM would give away most of the details.
 
 As with open source in general, it is about scratching our own
 itches. If you're using Lucene (or need to use Lucene) in a 1.2 VM,
 that is your itch to scratch and I would happily support your efforts
 in some way in documenting this (either on the wiki or embedded in
 Lucene's own built-in documentation) or in providing an alternative
 version of Lucene that is suitable for 1.2 (perhaps by having
 alternative code in a separate directory within our code
 repository). If you create such documentation, perhaps you'd be
 willing to donate it with full attribution to the 2nd edition of
 LIA. But please don't wait for me to do it, as it really is not
 something I need personally for any project - all my projects are at
 JDK 1.5 currently.
 
 Erik
 



Re: Serialized Java Objects

2005-08-25 Thread jian chen
Hi,

I don't think by default it does so. But, you can certainly serialize
the java object and use base 64 to encode it into a text string, then,
you can store it as a field.

Cheers,

Jian

On 8/25/05, Kevin L. Cobb [EMAIL PROTECTED] wrote:
 I just had a thought this morning. Does Lucene have the ability to store
 Serialized Java Objects for return during a search. I was thinking that
 this would be a nifty way to package up all of the return values for a
 search. Of course, I wouldn't expect the serialized objects would not be
 searchable.
 
 Thanks,
 
 -Kevin
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Integrate Lucene with Derby

2005-08-13 Thread jian chen
Hi, 

I am also interested in that. I haven't used Derby before, but it
seems the java database of choice as it is open source and a full
relational database.

I plant to learn the simple usage of Derby and then think about
integrating Derby with Lucene.

May we should post our progress for the integration and various
schemes of integration in this thread or somewhere else?

Thanks,

Jian

On 8/13/05, Mag Gam [EMAIL PROTECTED] wrote:
 Are there any documens or plans to integrate Lucene With Apache Derby
 (database)?
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Integrate Lucene with Derby

2005-08-13 Thread jian chen
I just downloaded a copy of the derby binary and successfully run the
simple example java program. It seems derby is extremely easy to use
as an embeded java database engine.

This gave me some confidence that I could integrate Lucene with Derby
and possibly Jetty server, to make a complete java based solution for
a hobby search project.

I will post more regarding this integration as I go along.

Cheers,

Jian
www.jhsystems.net

On 8/13/05, Mag Gam [EMAIL PROTECTED] wrote:
 yes. I have been looking for solutions for a while now. I am not too
 good with Java but I am learning it...
 
 I have asked the kind people of Derby-users, and they say there is no
 solution for this yet.
 
 I guess we can ask the people on the -developer list
 
 
 On 8/13/05, jian chen [EMAIL PROTECTED] wrote:
  Hi,
 
  I am also interested in that. I haven't used Derby before, but it
  seems the java database of choice as it is open source and a full
  relational database.
 
  I plant to learn the simple usage of Derby and then think about
  integrating Derby with Lucene.
 
  May we should post our progress for the integration and various
  schemes of integration in this thread or somewhere else?
 
  Thanks,
 
  Jian
 
  On 8/13/05, Mag Gam [EMAIL PROTECTED] wrote:
   Are there any documens or plans to integrate Lucene With Apache Derby
   (database)?
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DOM or XML representation of a query?

2005-08-10 Thread jian chen
Well, the good practice I think is to decouple the backend from the
front end as much as possible. You might have different versions of
java running for each end and also, there might be code compatibility
issues with different versions.

Jian

On 8/10/05, Andrew Boyd [EMAIL PROTECTED] wrote:
 Query is Serializable  why not use that?
 
 -Original Message-
 From: Roy Klein [EMAIL PROTECTED]
 Sent: Aug 10, 2005 10:08 AM
 To: java-user@lucene.apache.org
 Subject: DOM or XML representation of a query?
 
 Hi,
 
 The front-end guys working on my application need a way to pass me complex
 queries. I was thinking that it'd be pretty straightforward to hand them a
 package which helps them to create a DOM object which describes a query
 (i.e. nested Booleans combined with phrases and keyword searches, sort by
 field, etc).   I did a few searches in the archive of this list, but didn't
 find any examples, however, I suspect it's a common requirement amongst
 members of this list.
 
 Can anybody point be at an example of the above?
 
 Thanks!
 
 Roy
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 Andrew Boyd
 Software Architect
 Sun Certified J2EE Architect
 BB Technical Services Inc.
 205.422.2557
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Too many open files error using tomcat and lucene

2005-07-20 Thread jian chen
Hi, Dan,

I think the problem you mentioned is the one that has been discussed
lot of times in this mailing list.

Bottomline is that you'd better use the compound file format to store
indexes. I am not sure Lucene 1.3 has that available, but, if
possible, can you upgrade to lucene 1.4.3?

Cheers,

Jian

On 7/20/05, Dan Pelton [EMAIL PROTECTED] wrote:
 We are getting the following error in our tomcat error log.
 /dsk1/db/lucene/journals/_clr.f7 (Too many open files)
 java.io.FileNotFoundException: /dsk1/db/lucene/journals/_clr.f7 (Too many 
 open files)
  at java.io.RandomAccessFile.open(Native Method)
 
 We are using the following
 lucene-1.3-final
 SunOS thor 5.8 Generic_117350-21 sun4u sparc SUNW,Ultra-250
 tomcat 4.1.34
 Java 1.4.2
 
 
 Does any one have any idea how to resolve this. Is it an OS, java or tomcat
 problem.
 
 thanks,
 Dan P.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene index integrity during a system crash

2005-07-16 Thread jian chen
Hi, Otis,

Thanks for your email. As this is very important for using Lucene in
our production system, I looked at the code to try to understand. Here
is my observation why the index won't be corrupted during a system
crash.

In the IndexWriter.java mergeSegments(...) method, there are two lines:
segmentInfos.write(directory);// commit before deleting
deleteSegments(segmentsToDelete);//delete unused segments

The sgementInfos.write(...) writes the new segments file as
segments.new, once the write is complete, it renames segments.new
to segments.

I guess the rename operation is atomic as guaranteed by the operating
system. Otherwise, the segments file will be left in an inconsistent
state during the system crash.

It also appears to me that the segments file is the single point to
switch from old set of index segments to new ones. In case of a system
failure, the old segments file will be used anyway, so, no
corruption.

Is this understanding correct and thorough?

Thanks a lot,

Jian

On 7/16/05, Otis Gospodnetic [EMAIL PROTECTED] wrote:
 The only corruption that I've seen mentioned on this list so far was
 the corruption of the segments file, and even that people have been
 able to manually edit with a hex editor.
 
 Otis
 
 
 --- jian chen [EMAIL PROTECTED] wrote:
 
  Hi,
 
  I know Lucene does not have transaction support at this stage.
  However, I want to know what will happen if there is an operating
  system crash during the indexing process, will the Lucene index got
  corrupted?
 
  Thanks,
 
  Jian
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene index integrity during a system crash

2005-07-16 Thread jian chen
Thanks Otis and Nikhil for your confirmation. I am more confident
about the Lucene index integrity.

Nikhil, I don't see th reason why there is a corrupted .fdx file.
Could it be caused by multi-threaded access to the index?

Otis, I don't remember I asked about locking questions the other day.
I think it should be another guy.

Thanks all,

Jian

On 7/16/05, Otis Gospodnetic [EMAIL PROTECTED] wrote:
 Hi Jian,
 
 Yes, I think what you describes is correct.  You may end up with some
 junk index segments in the index directory, but as long as they are
 not recorded in segments file, they are irrelevant.
 
 Otis
 P.S.
 Did you ask you locking in Lucene the other day?
 
 
 --- jian chen [EMAIL PROTECTED] wrote:
 
  Hi, Otis,
 
  Thanks for your email. As this is very important for using Lucene in
  our production system, I looked at the code to try to understand.
  Here
  is my observation why the index won't be corrupted during a system
  crash.
 
  In the IndexWriter.java mergeSegments(...) method, there are two
  lines:
  segmentInfos.write(directory);  // commit before deleting
  deleteSegments(segmentsToDelete);//delete unused segments
 
  The sgementInfos.write(...) writes the new segments file as
  segments.new, once the write is complete, it renames segments.new
  to segments.
 
  I guess the rename operation is atomic as guaranteed by the operating
  system. Otherwise, the segments file will be left in an
  inconsistent
  state during the system crash.
 
  It also appears to me that the segments file is the single point to
  switch from old set of index segments to new ones. In case of a
  system
  failure, the old segments file will be used anyway, so, no
  corruption.
 
  Is this understanding correct and thorough?
 
  Thanks a lot,
 
  Jian
 
  On 7/16/05, Otis Gospodnetic [EMAIL PROTECTED] wrote:
   The only corruption that I've seen mentioned on this list so far
  was
   the corruption of the segments file, and even that people have been
   able to manually edit with a hex editor.
  
   Otis
  
  
   --- jian chen [EMAIL PROTECTED] wrote:
  
Hi,
   
I know Lucene does not have transaction support at this stage.
However, I want to know what will happen if there is an operating
system crash during the indexing process, will the Lucene index
  got
corrupted?
   
Thanks,
   
Jian
   
   
  -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
  
  
  -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: non-lexical comparisons

2005-07-07 Thread jian chen
Yeah, RDBMS makes sense. In this case, would it be better to simple
store those in a relational database and just use Lucene to do
indexing for the text?

Cheers,

Jian

On 7/7/05, Leos Literak [EMAIL PROTECTED] wrote:
 I know the answear, but just for curiosity:
 
 have you guys ever thought about non-lexical comparison
 support? For example I started to index number of replies
 in discussion, so I can find questions without answear,
 with one reply, two comments etc. But I cannot simply
 express that I want to find questions with more than five
 comments (there are ways using regexp, but I dont consider
 them as simple).
 
 Probably such feature belongs to RDBMS rather than to fulltext
 library .. I am just interested in you opinion. (I expect,
 that my users will raise the question, why they cannot use
 such condition so I ask in advance).
 
 Leos
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Retrieval model used by Lucene

2005-07-04 Thread jian chen
Well, I guess Lucene's Span query uses the Cover Density based model
(proximity model). However, it is within the framework of the TF*IDF
as well.

Jian

On 7/4/05, Dave Kor [EMAIL PROTECTED] wrote:
 Quoting [EMAIL PROTECTED]:
 
  Hi everybody,
 
  which kind of retrieval model is lucene using? Is it a simple vector model,
  a extended boolean model or another model? A reliable source with
  information about it would be fine, cause every source i found is telling
  something different. :)
 
 
 Lucene uses the standard vector space model, basically TF*IDF.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: No.of Files in Directory

2005-06-30 Thread jian chen
Hi,

My second suggestion is basically to store the user documents (word
docs) directly in lucene index.

1) If you are using Lucene 1.4.3, you can do something like this:

// suppose the word docs are now in byte array
byte[] wordDoc = getUploadedWordDoc();

// add the byte array to lucene index
Document doc = new Document();
doc.add(Field.UnIndexed(originalDoc, getBase64(wordDoc)));

The getBase64 method basically transforms the bytes into ASCII text, as follows:
String getBase64(byte[] wordDoc)
{
  byte[] chars = Base64.encodeBase64(wordDoc);
  String encodedStr = new String(chars, US-ASCII);
  return encodedStr;
}

You can get the Base64.java from 
http://jakarta.apache.org/commons/codec/apidocs/org/apache/commons/codec/binary/Base64.html

2) Correct me if I am wrong, I think the latest Lucene dev base has
the capability to direct add binary content to the Lucene index.

Looking at 
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/document/Field.java?view=markup
It has:
/**
   * Create a stored field with binary value. Optionally the value may
be compressed.
   * 
   * @param name The name of the field
   * @param value The binary value
   * @param store How codevalue/code should be stored (compressed or not.)
   */
  public Field(String name, byte[] value, Store store) {
.

So, I guess if you use the lastest Lucene dev base, you can do:
byte[] wordDoc = getUploadedWordDoc();
Document doc = new Document();
doc.add(new Field(originalDoc, wordDoc(), Store.YES);

I think Lucene index is pretty good in terms of storing millions of
small documents. However, there are two concerns that you might
address:

1) no transaction support for the index manipulation. I am not sure
what happens when the program is storing the original word document
meanwhile the machine gets shut down. Will the index be corrupted?

2) Since Lucene index is basically files in a physical directory, the
index file size could eventually hit a hard limit and then you have to
have another way to get around it. (Split up the index into two
indexes, or, you could configure Lucene for the
IndexWriter.DEFAULT_MAX_MERGE_DOCS?)

For example, I think some version of windoze (e.g., using FAT file
system), has a file size limit of 2GB.

Let me know if this helps.

Cheers,

Jian


On 6/29/05, bib_lucene bib [EMAIL PROTECTED] wrote:
 Thanks Jian
 
 I need to retrive the original document sometimes. I did not quite understand 
 your second suggestion.
 Can you please help me understand better, a pointer to some web resource will 
 also help.
 
 jian chen [EMAIL PROTECTED] wrote:
 Hi,
 
 Depending on the operating system, there might be a hard limit on the
 number of files in one directory (windoze versions). Even with
 operating systems that don't have a hard limit, it is still better not
 to put too many files in one directory (linux).
 
 Typically, the file system won't be very efficient in terms of file
 retrieval if there are nore than couple thousand files in one
 directory.
 
 There are some ways to tackle this issue.
 
 1) Use a hash function to distribute the files to different sub
 directories based on the file name. For example, use the MD5 algorithm
 in Java or CRC algorithm in java, hash the file name to a number, use
 this number to construct directory. For example, if the number you
 hashed is 123456, then, you can make 123 as a sub-dir name, and 456 as
 the sub-sub dir name, so forth.
 
 I think the SQUID web proxy server uses this approach to do the file caching.
 
 2) Why not use Lucene's indexing algorithm and store binary files with
 lucene index?! I love the indexing algorithm, in that, you don't need
 to manage the free space like that in a typical file system. Because
 the merge process will take care of reclaiming the free space
 automatically.
 
 Should these two advices be good?
 
 Jian
 
 On 6/29/05, bib_lucene bib wrote:
  Hi All
 
  In my webapp i have people uploading their documents. My server is 
  windows/tomcat. I am thinking there will be a limit on the no of files in a 
  directory. Typically apllication users will load 3-5 page word docs.
 
  1. How does one design the system such that there will not be any problem 
  as the users keep uploading their files, even if a million files are 
  reached.
  2. Is there a sample application that does this.
  3. Should I have lucene update index after each upload or should I do it 
  like once a day.
 
  Thanks
  Bib
 
  __
  Do You Yahoo!?
  Tired of spam? Yahoo! Mail has the best spam protection around
  http://mail.yahoo.com
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com

question regarding the commit.lock

2005-06-29 Thread jian chen
Hi,

I am looking at and trying to understand more about Lucene's
reader/writer synchronization. Does anyone know when the commit.lock
is release? I could not find it anywhere in the source code.

I did see the write.lock is released in IndexWriter.close().

Thanks,

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Design question [too many fields?]

2005-06-29 Thread jian chen
Hi, Naimdjon,

I have some suggestions as well along the lines of Mark Harwood. 

As an example, suppose for each hotel room there is a description, and
you want the user to  do free text search on the description field.
You could do the following:

1) store hotel room reservation info as rows in a relational database
create table reservation (
id int,
room_no int,
reservation_start_date timestamp,
reservation_end_date timestamp,
primary key (id)
)

2) store description for each hotel room in Lucene index with two
fields, i.e., room_no, description

3) provide the user with free text search in room description as well
as availability info like the following:
--do full text search on description using the Lucene index
--get the room numbers from the search result documents
--using these room numbers, look up in the reservation table to see if
the user specified start date and end date is not already reserved.
--The top serveral rooms that are high on the free text search result
and also not reserved will be returned to the user

How does this sound?

Jian

On 6/29/05, Erik Hatcher [EMAIL PROTECTED] wrote:
 I second Mark's suggestion over the alternative I posted.  My
 alternative was merely to invert the field structure originally
 described, but using a Filter for the volatile information is wiser.
 
  Erik
 
 On Jun 29, 2005, at 9:58 AM, mark harwood wrote:
 
  Presumably there is also a free-text element to the
  search or you wouldn't be using Lucene.
 
  Multiple fields is not the way to go.
  A single Lucene field could contain multiple terms (
  the available dates) but I still don't think that's
  the best solution.
  The availability info is likely to be pretty volatile
  and you always want up-to-date info so I would prefer
  to hit a database for this. If you keep a DB primary
  key to Lucene doc id look-up cached in memory you can
  quickly construct a Lucene filter from the database
  results and therefore only show Lucene results for
  available rooms.
 
  Cheers
  Mark
 
 
 
  ___
  How much free photo storage do you get? Store your holiday
  snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Strategy for making short documents not bubble to the top?

2005-06-29 Thread jian chen
Hi,

I would use pure span or cover density based ranking algorithm which
do not take document length into consideration. (tweaking whatever
currently in the standard Lucene distribution?)

For example, searching for the keywords beautiful house, span/cover
ranking will treat a long document and a short document the same
ranking as long as they have the same number of spans/covers (for
example, beautiful xx house is one cover), and with each
span/cover, the editing distance between the keywords is the same.

Just my 2 cents, 

Cheers,

Jian

On 29 Jun 2005 20:30:49 -, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
 Hi,
 
 Short documents bubble to the top of the results because the field
 length is short.  Does anyone have a good strategy for working around this?
  Will doing something like log(document length) flatten out my results while
 still making them meaningful?  I'm going to try some different approaches
 but any advice is appreciated.
 
 Thanks.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: No.of Files in Directory

2005-06-29 Thread jian chen
Hi,

Depending on the operating system, there might be a hard limit on the
number of files in one directory (windoze versions). Even with
operating systems that don't have a hard limit, it is still better not
to put too many files  in one directory (linux).

Typically, the file system won't be very efficient in terms of file
retrieval if there are nore than couple thousand files in one
directory.

There are some ways to tackle this issue.

1) Use a hash function to distribute the files to different sub
directories based on the file name. For example, use the MD5 algorithm
in Java or CRC algorithm in java, hash the file name to a number, use
this number to construct directory. For example, if the number you
hashed is 123456, then, you can make 123 as a sub-dir name, and 456 as
the sub-sub dir name, so forth.

I think the SQUID web proxy server uses this approach to do the file caching.

2) Why not use Lucene's indexing algorithm and store binary files with
lucene index?! I love the indexing algorithm, in that, you don't need
to manage the free space like that in a typical file system. Because
the merge process will take care of reclaiming the free space
automatically.

Should these two advices be good?

Jian

On 6/29/05, bib_lucene bib [EMAIL PROTECTED] wrote:
 Hi All
 
 In my webapp i have people uploading their documents. My server is 
 windows/tomcat. I am thinking there will be a limit on the no of files in a 
 directory. Typically apllication users will load 3-5 page word docs.
 
 1. How does one design the system such that there will not be any problem as 
 the users keep uploading their files, even if a million files are reached.
 2. Is there a sample application that does this.
 3. Should I have lucene update index after each upload or should I do it like 
 once a day.
 
 Thanks
 Bib
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lock File exceptions

2005-06-27 Thread jian chen
Hi,

Recently I looked at the locking mechanism of Lucene. If I am correct,
I think the process for grabbing the lock file will time out by
default in 10 seconds. When the process timed out, it will print out
the IOException.

The lucene locking mechanism is not within threads in the same JVM. It
uses lock files so that other processes (even perl program) could also
be synchronized in terms of accessing the index.

The implementation of the current Lucene locking mechanism uses
polling mechanism, i.e, constantly check if the lock file could be
obtained. It would be better that a wait/notify mechanism could be
used rather than polling.

If you don't care about other JVM or process access, maybe you can use
the Java 1.5 reader/writer lock mechanism for synchronizing between
multiple readers and one writers?

Cheers,

Jian

On 6/27/05, Yousef Ourabi [EMAIL PROTECTED] wrote:
 Hello:
 I get this lock-file exception on both Windows and Linux, my app is
 running inside tomcat 5.5.9, jvm 1.5.03...has anyone seen this before?
 
 If I delete the LOCK file it works, but obviously I shouldn't do
 that...Just wondering what's up?
 
 IOException caught here: Lock obtain timed out:
 Lock@/usr/local/java/jakarta-tomcat-5.5.9/temp/lucene-4f978fb745a946b4dbce87bf411caa25-write.lock
 
 Thanks in advance for any help.
 
 -Yousef
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fwd: when is the commit.lock released?

2005-06-27 Thread jian chen
Hi, 

I haven't heard anything back. Probably this email got lost on the way
or whatsoever.

Anyway, could anyone enlighten me on this?

Thanks,

Jian

-- Forwarded message --
From: jian chen [EMAIL PROTECTED]
Date: Jun 26, 2005 12:59 PM
Subject: when is the commit.lock released?
To: java-user@lucene.apache.org


Hi,

I am looking at and trying to understand more about Lucene's
reader/writer synchronization. Does anyone know when the commit.lock
is release? I could not find it anywhere in the source code.

I did see the write.lock is released in IndexWriter.close().

Thanks,

Jian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



when is the commit.lock released?

2005-06-26 Thread jian chen
Hi,

I am looking at and trying to understand more about Lucene's
reader/writer synchronization. Does anyone know when the commit.lock
is release? I could not find it anywhere in the source code.

I did see the write.lock is released in IndexWriter.close().

Thanks,

Jian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Span query performance issue

2005-06-24 Thread jian chen
Hi,

I think Span query in general should do more work than simple Phrase
query. Phrase query, in its simplest form, should just try to find all
terms that are adjacent to each other. Meanwhile, Span query does not
necessary be adjacent to each other, but, with other words in between.

Therefore, I think Span query deserves to be slower than Phrase query.
This said, Span query is way more powerful than Phrase query.

Jian

On 25 Jun 2005 00:00:18 -, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
 Hi,
 
 I'm comparing SpanNearQuery to PhraseQuery results and noticing about
 an 8x difference on Linux.  Is a SpanNearQuery doing 8x as much work?
 
 
 I'm considering diving into the code if the results sounds unusual to people.
  But if its really doing that much more work, I won't spend time optimizing
 something that can't get much faster.
 
 Thanks.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



document ids in cached in Hits and index merge

2005-06-24 Thread jian chen
Hi,

I have a stupid question regarding the transient nature of the document ids. 

As I understand, documents will obtain new doc ids during the index
merge. Suppose if you do a search and got the Hits object. When you
iterate through the documents by id, the index merge happens. How the
merge and new ids created do not mess up the retrieval of Hits
documents?

Could anyone please enlighten me on this synchronization issue?

Thanks a lot,

Jian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Updateing Documents:

2005-06-21 Thread jian chen
Hi,

You may look at this website
http://www.zilverline.org

Cheers,

Jian

On 6/21/05, Markus Atteneder [EMAIL PROTECTED] wrote:
 I am looking for a SearchEngine for our Intranet and so i deal with Lucene.
 I  have read the FAQ and some Postings and i got first experiences with it
 and now i have some questions.
 1. Is lucene a suitable SearchEngine for a Intranetsearch? I've experienced
 with poi and pdfbox for indexing Word/Excel/PDF files.
 2. Files are changing frequently, so the indexing should run at least daily.
 Is there a possibility out of the box to delete changed files from the index
 and readd them to the index? I've read that documents only can be deleted if
 you know the ID of the document in the index and that could change after a
 optimization of the index. Is there a best practice for that? I thind a
 full indexing every day is not a good solution because of the datavolume.
 3. Does anyone know a project based on lucene that offers a complete
 solution for a Intranetsearch?
 
 --
 Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
 ++ Jetzt anmelden  testen ++ http://www.gmx.net/de/go/promail ++
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi,

Interesting topic. I thought about this as well. I wanted to index
Chinese text with English, i.e., I want to treat the English text
inside Chinese text as English tokens rather than Chinese text tokens.

Right now I think maybe I have to write a special analyzer that takes
the text input, and detect if the character is an ASCII char, if it
is, assembly them together and make it as a token, if not, then, make
it as a Chinese word token.

So, bottom line is, just one analyzer for all the text and do the
if/else statement inside the analyzer.

I would like to learn more thoughts about this!

Thanks,

Jian

On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote:
 Hi all,
 
 The DSpace (www.dspace.org) currently uses Lucene to index metadata
 (Dublin Core standard) and extracted full-text content of documents
 stored in it.  Now the system is being used globally, it needs to
 support multi-language indexing.
 
 I've looked through the mailing list archives etc. and it seems it's
 easy to plug in analyzers for different languages.
 
 What if we're trying to index multiple languages in the same site?  Is
 it best to have:
 
 1/ one index for all languages
 2/ one index for all languages, with an extra language field so searches
 can be constrained to a particular language
 3/ separate indices for each language?
 
 I don't fully understand the consequences in terms of performance for
 1/, but I can see that false hits could turn up where one word appears
 in different languages (stemming could increase the changes of this).
 Also some languages' analyzers are quite dramatically different (e.g.
 the Chinese one which just treats every character as a separate
 token/word).
 
 On the other hand, if people are searching for proper nouns in metadata
 (e.g. DSpace) it may be advantageous to search all languages at once.
 
 
 I'm also not sure of the storage and performance consequences of 2/.
 
 Approach 3/ seems like it might be the most complex from an
 implementation/code point of view.
 
 Does anyone have any thoughts or recommendations on this?
 
 Many thanks,
 
  Robert Tansley / Digital Media Systems Programme / HP Labs
   http://www.hpl.hp.com/personal/Robert_Tansley/
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi, Erik,

Thanks for your info. 

No, I haven't tried it yet. I will give it a try and maybe produce
some Chinese/English text search demo online.

Currently I used Lucene as the indexing engine for Velocity mailing
list search. I have a demo at www.jhsystems.net.

It is yet another mailing list search for Velocity, but I combined
date as well as full text search together.

I only used lucene for indexing the textual content, and combined
database search with lucene search in returning the results.

The other interesting thought I have is: maybe it is possible to use
Lucene's merge segments mechanism to write a java based simple file
system. Which of course, does not require constant compact operation.
The file system could be based on one file only, where segments are
just part of the big file. It might be really efficient in terms of
adding/deleting the objects all the time.

Lastly, any comments welcome for www.jhsystems.net Velocity search.

Thanks,

Jian
www.jhsystems.net

On 5/31/05, Erik Hatcher [EMAIL PROTECTED] wrote:
 Jian - have you tried Lucene's StandardAnalyzer with Chinese?  It
 will keep English as-is (removing stop words, lowercasing, and such)
 and separate CJK characters into separate tokens also.
 
  Erik
 
 
 On May 31, 2005, at 5:49 PM, jian chen wrote:
 
  Hi,
 
  Interesting topic. I thought about this as well. I wanted to index
  Chinese text with English, i.e., I want to treat the English text
  inside Chinese text as English tokens rather than Chinese text tokens.
 
  Right now I think maybe I have to write a special analyzer that takes
  the text input, and detect if the character is an ASCII char, if it
  is, assembly them together and make it as a token, if not, then, make
  it as a Chinese word token.
 
  So, bottom line is, just one analyzer for all the text and do the
  if/else statement inside the analyzer.
 
  I would like to learn more thoughts about this!
 
  Thanks,
 
  Jian
 
  On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote:
 
  Hi all,
 
  The DSpace (www.dspace.org) currently uses Lucene to index metadata
  (Dublin Core standard) and extracted full-text content of documents
  stored in it.  Now the system is being used globally, it needs to
  support multi-language indexing.
 
  I've looked through the mailing list archives etc. and it seems it's
  easy to plug in analyzers for different languages.
 
  What if we're trying to index multiple languages in the same
  site?  Is
  it best to have:
 
  1/ one index for all languages
  2/ one index for all languages, with an extra language field so
  searches
  can be constrained to a particular language
  3/ separate indices for each language?
 
  I don't fully understand the consequences in terms of performance for
  1/, but I can see that false hits could turn up where one word
  appears
  in different languages (stemming could increase the changes of this).
  Also some languages' analyzers are quite dramatically different (e.g.
  the Chinese one which just treats every character as a separate
  token/word).
 
  On the other hand, if people are searching for proper nouns in
  metadata
  (e.g. DSpace) it may be advantageous to search all languages at
  once.
 
 
  I'm also not sure of the storage and performance consequences of 2/.
 
  Approach 3/ seems like it might be the most complex from an
  implementation/code point of view.
 
  Does anyone have any thoughts or recommendations on this?
 
  Many thanks,
 
   Robert Tansley / Digital Media Systems Programme / HP Labs
http://www.hpl.hp.com/personal/Robert_Tansley/
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]