Re: lucene memory consumption
Not that I can think about. But, if you have any cached field data, norms array, that could be huge. Would be interested in knowing from others regarding this topic as well. Jian On 5/29/08, Alex [EMAIL PROTECTED] wrote: Hi, other than the in memory terms (.tii), and the few kilobytes of opened file buffer, where are some other sources of significant memory consumption when searching on a large index ? ( 100GB). The queries are just normal term queries. _ 隨身的 Windows Live Messenger 和 Hotmail,不限時地掌握資訊盡在指間 — Windows Live for Mobile http://www.msn.com.tw/msnmobile/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
simultaneous read and writes to the RAMDirectory
Lucene gurus, I have a question regarding RAMDirectory usage. Can the IndexWriter keep adding documents to the index meanwhile the IndexReader is open on this RAMDirectory and searches going on? I know in a FSDirectory case, the IndexWriter can add documents to the index meanwhile IndexReader reads from the index. This is because the IndexWriter just writes new index files rather than modifying existing index files. The only place (I remember) that the new and old indexes will conflict is the segment file. Again, once the IndexWriter commits the change (by calling close() method), the segment.new is renamed to segment atomically. Since the old segment file is cached in memory by the IndexReader, so not a problem for the IndexReader to serve search requests. The old segment file is cached in memory, the other files pointed to by the old segment file are cached by Linux anyway, or not removed by windows due to still being used. Anyway, back to the RAMDirectory case. Having an IndexReader open while IndexWriter adding documents to it, will that cause any issue? Thanks, Jian
two copies of indexes vs. master/slave indexes
I have seen two different designs for incremental index updates. 1) Have two copies of indexes A and B. The incremental updates happens on A index while B index is being used for search. Then, hot swap the two indexes. Bring B index up to date and perform incremental updates thereafter. In this scenario, searches are performed on index A or B alternatively. 2) Have a master index where the incremental updates are applied. Then, the slave indexes got synced up with the master index. Searches are performed only on the slave indexes. So, I want to know what are the trade offs between the two approaches? For scalability, what's the best approach? Thanks, Jian
Re: Build vs. Buy?
For reading word document as text, you can try AntiWord. I have written a simplified Lucene that does Max words match. For example, if you are searching for aa, bb, cc, then, the document that contains all words (aa, bb, cc) will be definitely ranked higher than documents containing either aa, bb or aa, cc or bb, cc. I am going to put up the code as open source. If you are interested, you can email me directly. Jian On 2/9/06, P. Alex. Salamanca R. [EMAIL PROTECTED] wrote: On the other hand, if you want be the most cheapest, why don't give a chance to google search appliance?
Re: Urgent - File Lock in Lucene 1.2
Hi, Karl, Therer have been quite some discussions regarding the too many open files problem. From my understanding, it is due to Lucene trying to open multiple segments at the same time (during search/merging segments), and the operating system wouldn't allow opening that many file handles. If you have a lot of fields, each will have its own file (set of files, maybe? I couldn't remember). This could cause the above issue. The way to fix this, is to combine all the files for each segment into one physical file. When the physical file is open, multiple streams will be read from the physical file. This fix went into Lucene 1.4 I think but not available for Lucene 1.2. Currently I am trying to find some spare time so that I could port the compound file format (.cfs) feature from Lucene 1.4 to Lucene 1.2. Hope this information could help you. Cheers, Jian On 11/20/05, Karl Koch [EMAIL PROTECTED] wrote: Hello group, I am running Lucene 1.2 and I have the following error message. I got this message when performing a search: Failed to obtain file lock on /tmp/qcop-msg-qpe I am running Lucene 1.2 on a Sharp Zaurus PDA with embedded Linux. When I look through the exceptions I have before that I can see that I have an IOException Too many open files happening somewhere in the FSDirectory... Regards, Karl -- Telefonieren Sie schon oder sparen Sie noch? NEU: GMX Phone_Flat http://www.gmx.net/de/go/telefonie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: List of removed stop words?
Hi, In case you are using StandardAnalyzer, there is a stop word list. I have used StandardAnalyzer.STOP_WORDS, which is a String[]. Cheers, Jian On 10/31/05, Rob Young [EMAIL PROTECTED] wrote: Hi, Is there an easy way to list stop words that were removed from a string? I'm using the standard analyzer on user's searchstrings and I would like to let them know when stop words have been removed (ala google). Any ideas? Cheers Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: trying to boost a phrase higher than its individual words
Hi, It seems what you want to achieve could be implemented using the Cover Density algorithm. I am not sure if any existing query classes in the Lucene distribution does this already. But in case not, this is what I am think about: Make a custom query class, called CoverDensityQuery, which is modeled after PhraseQuery. The CoverDensityQuery could accept two arguments as its constructor, the Terms and the numOfTermsMatched. For example, to search for classical music, you will first construct CoverDensityQuery like: new CoverDensityQuery(new String[]{classical, music}, 2); This should return all documents that contain both classical and music. The ranking will be based on covers, each cover is a span with the two terms at each end. The shorter the cover, the higher the rank, the more the covers, the higher the rank. If the returned documents are not enough, then, do another query like: new CoverDensityQuery(new String[]{classical, music}, 1); This should return documents either containing classical or music, but not both. The detailed algorithm will be constructed similar to PhraseQuery. I will write such a query class in the future, just as a proof of concept for cover density algorithm. Cheers, Jian On 10/27/05, Andy Lee [EMAIL PROTECTED] wrote: I have a situation where I want to search for individual words in a phrase as well as the phrase itself. For example, if the user enters [classical music] (with quotes) I want to find documents that contain classical music (the phrase) *and* the individual words classical and music. Of course, I could just search for the individual words and the phrase would get found as a consequence. But I want documents containing the phrase to appear first in the search results, since the phrase is the user's primary interest. I've constructed the following query, using boost values... [+(content:classical music^5.0 content:classical^0.1 content:music^0.1)] ...but the boost values don't seem to affect the order of the search results. Am I misunderstanding the purpose or proper usage of boosts, and if so, can someone explain (at least roughly) how to achieve the desired result? --Andy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: java on 64 bits
Hi, Also, I think you may try to increase the indexInterval, it is set to 128, but getting it larger, the .tii files will be smaller. Since .tii files are loaded into memory as a whole, so, your memory usage might be smaller. However, this change might affect your search speed. So, be careful about the value you want to set, not too high though. Just my thoughts, hope helps. Jian On 10/21/05, Aigner, Thomas [EMAIL PROTECTED] wrote: I have seen quite a few posts on using the 1.9 dev version for production uses. How stable is it? Is it really ready for production? I would like to use it.. but I never ever put beta packages in procution.. but then again.. I'm always dealing with Microsoft :) Tom -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Friday, October 21, 2005 9:28 AM To: java-user@lucene.apache.org Subject: Re: java on 64 bits 1) make sure the failure was due to an OutOfMemory exception and not something else. 2) if you have enough memory, increase the max JVM heap size (-Xmx) 3) if you don't need more than 1.5G or so of heap, use the 32 bit JVM instead (depending on architecture, it can acutally be a little faster because more references fit in the CPU cache). 4) see how many indexed fields you have and if you can consolidate any of them 4.5) if you don't have too many indexed fields, and have enough spare file descriptors, try using the non-compound file format instead. 5) run with the latest version of lucene (1.9 dev version) which may have better memory usage during optimizes segment merges. 6) If/when optional norms http://issues.apache.org/jira/browse/LUCENE-448 makes it into lucene, you can apply it to any indexed fields for which you don't need index-time boosting or length normalization. As for getting rid of your current intermediate files, I'd rebuild from scratch just to ensure things are OK. -Yonik Now hiring -- http://tinyurl.com/7m67g On 10/21/05, Roxana Angheluta [EMAIL PROTECTED] wrote: Thank you, Yonik, it seems this is the case. What can we do in this case? Would running the program with java -d32 be a solution? Thanks again, roxana One possibility: if lucene runs out of memory while adding or optimizing, it can leave unused files beind that increase the size of the index. A 64 bit JVM will require more memory than a 32 bit one due to the size of all references being doubled. If you are using the compound file format (the default - check for .cfs files), then it's easy to check if you have this problem by seeing if there are any *.f* files in the index directory. These are intermediate files and shouldn't exist for long in a compound-file index. -Yonik Now hiring -- http://tinyurl.com/7m67g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large queries
Hi, Trond, It should be no problem for Lucene to handle 6 million documents. For your query, it seems you want to do a disjunctive (or'ed) query for multiple terms, 10 terms or 1 terms for example. The worst case I can think of is, you can very easily write your own query class to handle this, utilizing the TermDocs iterator class. Say, if you want to have one of 10 docID's, you have 10 TermDocs. Each TermDoc corresponds to a term. Then, you can do a multi-way (in this case, 10 way) merge of these 10 TermDocs, and generate a final list of the doc ids. I suggest that you can look at PhraseQuery and PhraseScrorer to see how it does the conjunctive merge to find the docs that contains all the terms. In your case, instead of doing intersection, you are doing a union of all the term docs, right? Maybe there is already some query class that comes with the Lucene package that does this. However, the method I described should also help just in case. Cheers, Jian On 10/16/05, Trond Aksel Myklebust [EMAIL PROTECTED] wrote: How is Lucene handling very large queries? I have 6million documents, which each has a docID field. There is a total of 2 distinct docID's, so many documents got the same docID which consists of a filename (only name, not path). Sometimes, I must get all documents that has one of 10 docID's, and sometimes I need to get all documents that has one of 1 docIDs. Is there any other way than doing a query: docID:(file1 file2 file3 file4..) ? Trond A Myklebust
Re: maximum number of documents
Hi, Koji, I think you are right, the max num of documents should be Integer.MAX_VALUE. Some more points below: 1) I double checked the Lucene documentation. It mentioned in the file format that SegSize is UInt32. I don't think this is accurate, as UInt32 is around 4 billion, but Integer.MAX_VALUE is half of that, around 2 billion. In java, there is no notion of unsigned integer, so, since Lucene uses integer to store doc ids, the max you can get is therefore 2 billion. Maybe the documentation could mention it in more detail? Specifically, the actual max number of a document id 2147483647 could be mentioned? 2) I think in theory, if you index 8 billion docs, you can use 4 indexes, and when you do the search, just search all 4 indexes and combine the result set. 3) Looking at the Lucene source code, it seems not that difficult to change the doc id to use Long instead. It occurs to me that the OutputStream's writeVInt and writeVLong are using exactly the same code. So, there should be no performance penalty to switch to using Long. 4) However, if you have 8 billion to index, just changing doc id to use Long is not enough I guess. You may also need to adjust other parameters, such as the IndexInterval (for storing the term info index). Because the term info index (tii) is loaded into memory totally, so, instead of leaving it as 128, you may have to change it to 256 or bigger, to avoid out of memory issue. Cheers, Jian On 10/12/05, Koji Sekiguchi [EMAIL PROTECTED] wrote: Hello, Is the maximum number of documents in an index Integer.MAX_VALUE? (approx 2 billion) If so, if I want to have 8 billion docs indexed, like Google, can I do it with having four indices, theoretically? Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing HashMap as an UnIndexed Field
well, certainly you can serialize into a byte stream and encode it using base64. Jian On 9/20/05, Mordo, Aviran (EXP N-NANNATEK) [EMAIL PROTECTED] wrote: I can't think of a way you can use serialization, since lucene only works with strings. -Original Message- From: Tricia Williams [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 20, 2005 3:30 PM To: java-user@lucene.apache.org Subject: RE: Storing HashMap as an UnIndexed Field Do you think there is anyway that I could use the serialization already built into the HashMap data structure? On Tue, 20 Sep 2005, Mordo, Aviran (EXP N-NANNATEK) wrote: You can store the values as a coma separated string (which then you'll need to parse manually back to a HashMap) -Original Message- From: Tricia Williams [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 20, 2005 3:14 PM To: java-user@lucene.apache.org Subject: Storing HashMap as an UnIndexed Field Hi, I'd like to store a HashMap for some extra data to be used when a given document is retrieved as a Hit for a query. To add an UnIndexed Field to an index takes only Strings as parameters. Does anyone have any suggestions on how I might convert the HashMap to a String that is efficiently recomposed into the desired HashMap on the other end? Thanks, Tricia - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
storing inverted document as a field
Hi, I am playing with Lucene source code and have this somewhat stupid question, so please bear with me ;-) Basically, I want to implement a custom ranking algorithm. That is, iterating through the documents that contains all the search keywords, for each document, retrieve its inverted document and rank it based on the inverted document as a whole. Because of this thought, I want to store the inverted document for each document as a field. My question is, is this kind of data structure fast enough for searching, compared to the current Lucene approach where the proximity data is stored in the .prx files? I know Lucene has (sloppy) phrase query, span query, but I am trying to be more familar with Lucene by implementing a custom query. Thanks in advance for any suggestion or enlightenment! Jian
Re: Small problem in searching
Hi, I think Lucene transforms the prefix match query into all sub queries where the searching for a prefix could result into search for all terms that begin with that prefix. For postfix match, I think you need to do more work than relying on Lucene's query parser. You can iterate over the terms and do an endsWith() call, and if there is a match, then, perform a normal Lucene search for that term. So, effectively, you do the same thing as prefix match, conceptually loop over all available terms in your dictionary and find all the terms to be prepared for actual searching. This might be slow. What you might want to speed up the performance is, you can store all the available terms in-memory, and looping through all unique terms is a breeze. This is what google used for their prototype search engine when they were way back in the 1998s. (I guess :-) Cheers, Jian On 9/15/05, tirupathi reddy [EMAIL PROTECTED] wrote: Hi guys, I have some problem while searching using Lucene. Say I have some thing like tirupathireddy or venkatreddy in the index. When i search for string reddy I have to get those things (i.e. tirupathireddy and venkatreddy). I have read in Query syntax of Lucene that * will not be given at the starting of the search string. SO how can I achiev that. I am in very much need of that. So please help me out. WIth Regards, TirupatiReddy Manyam. Tirupati Reddy Manyam 24-06-08, Sundugaullee-24, 79110 Freiburg GERMANY. Phone: 00497618811257 cell : 004917624649007 - Yahoo! for Good Click here to donate to the Hurricane Katrina relief effort.
Re: Lucene does NOT use UTF-8.
Hi, It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? Cheers, Jian On 8/26/05, Marvin Humphrey [EMAIL PROTECTED] wrote: Greets, [crossposted to java-user@lucene.apache.org and [EMAIL PROTECTED] I've delved into the matter of Lucene and UTF-8 a little further, and I am discouraged by what I believe I've uncovered. Lucene should not be advertising that it uses standard UTF-8 -- or even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8. The two distinguishing characteristics of Modified UTF-8 are the treatment of codepoints above the BMP (which are written as surrogate pairs), and the encoding of null bytes as 1100 1000 rather than . Both of these became illegal as of Unicode 3.1 (IIRC), because they are not shortest-form and non-shortest-form UTF-8 presents a security risk. The documentation should really state that Lucene stores strings in a Java-only adulteration of UTF-8, unsuitable for interchange. Since Perl uses true shortest-form UTF-8 as its native encoding, Plucene would have to jump through two efficiency-killing hoops in order to write files that would not choke Lucene: instead of writing out its true, legal UTF-8 directly, it would be necessary to first translate to UTF-16, then duplicate the Lucene encoding algorithm from OutputStream. In theory. Below you will find a simple Perl script which illustrates what happens when Perl encounters malformed UTF-8. Run it (you need Perl 5.8 or higher) and you will see why even if I thought it was a good idea to emulate the Java hack for encoding Modified UTF-8, trying to make it work in practice would be a nightmare. If Plucene were to write legal UTF-8 strings to its index files, Java Lucene would misbehave and possibly blow up any time a string contained either a 4-byte character or a null byte. On the flip side, Perl will spew warnings like crazy and possibly blow up whenever it encounters a Lucene-encoded null or surrogate pair. The potential blowups are due to the fact that Lucene and Plucene will not agree on how many characters a string contains, resulting in overruns or underruns. I am hoping that the answer to this will be a fix to the encoding mechanism in Lucene so that it really does use legal UTF-8. The most efficient way to go about this has not yet presented itself. Marvin Humphrey Rectangular Research http://www.rectangular.com/ # #!/usr/bin/perl use strict; use warnings; # illegal_null.plx -- Perl complains about non-shortest-form null. my $data = foo\xC0\x80\n; open (my $virtual_filehandle, +:utf8, \$data); print $virtual_filehandle; - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: read past EOF
Hi, It seems this problem only happens when the index files get really large. Could it be because java has trouble handling very large files on windows machine (guess there is max file size on windows)? In Lucene, I think there is a maxDoc kind of parameter that you can use to specify, when the index gets really large containing more than that many of documents, it will not try to merge the index files into one. Could this be used to stop the index files from growing forever? Cheers, Jian On 8/27/05, Ouyang, Hui [EMAIL PROTECTED] wrote: Hi, I had lots of docs out of order issues when the index is optimized. I did the changes based on the suggestion in this link http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23650 It seems this issue is solved. But some index have read past EOF when I do optimization. The index is over 2G and there are some documents deleted from the index. It is based on Lucene 1.4.3 on Windows. Does anyone know how to avoid this issue? Thx. Regards, hui merging segments _1ny5 (38708 docs) _1ot0 (1000 docs) _1t2m (4810 docs)java.io.I OException: read past EOF at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal (CompoundFileReader.java:218) at org.apache.lucene.store.InputStream.readBytes(InputStream.java:61) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:4 29) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:94) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:51 0) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:370) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Books about Lucene?
Hi, Erik, I some time ago played with the Lucene 1.2 source code and made some modifications to it, trying to add my own ranking algorithm. I am not sure if Licence wise, it is permissible to modify the earlier source code, also if it is allowed to put the modified version or the description of what I have done on wiki? Thanks for your reply. Jian On 8/26/05, Erik Hatcher [EMAIL PROTECTED] wrote: I appreciate the vote of confidence on this, but I am not afraid to admit that I do not consider myself an expert on the deep innards of Lucene. I understand the concepts, and a bit of the internals, but I certainly do not live up to the hype you just bestowed upon me. *blush* Regarding JDK 1.2 - I came to Java at 1.3, and have never used a JDK earlier than that. All the apps I build now are currently on JDK 1.5 (err... 5.0). I do not currently know what would be involved in running Lucene on a 1.2 VM. The first question to ask is whether an earlier version of Lucene is sufficient for the needs of those constrained to JDK 1.2. If not, then we move forward to defining what needs to be changed - a simple compilation of the trunk source code with a 1.2 VM would give away most of the details. As with open source in general, it is about scratching our own itches. If you're using Lucene (or need to use Lucene) in a 1.2 VM, that is your itch to scratch and I would happily support your efforts in some way in documenting this (either on the wiki or embedded in Lucene's own built-in documentation) or in providing an alternative version of Lucene that is suitable for 1.2 (perhaps by having alternative code in a separate directory within our code repository). If you create such documentation, perhaps you'd be willing to donate it with full attribution to the 2nd edition of LIA. But please don't wait for me to do it, as it really is not something I need personally for any project - all my projects are at JDK 1.5 currently. Erik
Re: Serialized Java Objects
Hi, I don't think by default it does so. But, you can certainly serialize the java object and use base 64 to encode it into a text string, then, you can store it as a field. Cheers, Jian On 8/25/05, Kevin L. Cobb [EMAIL PROTECTED] wrote: I just had a thought this morning. Does Lucene have the ability to store Serialized Java Objects for return during a search. I was thinking that this would be a nifty way to package up all of the return values for a search. Of course, I wouldn't expect the serialized objects would not be searchable. Thanks, -Kevin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Integrate Lucene with Derby
Hi, I am also interested in that. I haven't used Derby before, but it seems the java database of choice as it is open source and a full relational database. I plant to learn the simple usage of Derby and then think about integrating Derby with Lucene. May we should post our progress for the integration and various schemes of integration in this thread or somewhere else? Thanks, Jian On 8/13/05, Mag Gam [EMAIL PROTECTED] wrote: Are there any documens or plans to integrate Lucene With Apache Derby (database)? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Integrate Lucene with Derby
I just downloaded a copy of the derby binary and successfully run the simple example java program. It seems derby is extremely easy to use as an embeded java database engine. This gave me some confidence that I could integrate Lucene with Derby and possibly Jetty server, to make a complete java based solution for a hobby search project. I will post more regarding this integration as I go along. Cheers, Jian www.jhsystems.net On 8/13/05, Mag Gam [EMAIL PROTECTED] wrote: yes. I have been looking for solutions for a while now. I am not too good with Java but I am learning it... I have asked the kind people of Derby-users, and they say there is no solution for this yet. I guess we can ask the people on the -developer list On 8/13/05, jian chen [EMAIL PROTECTED] wrote: Hi, I am also interested in that. I haven't used Derby before, but it seems the java database of choice as it is open source and a full relational database. I plant to learn the simple usage of Derby and then think about integrating Derby with Lucene. May we should post our progress for the integration and various schemes of integration in this thread or somewhere else? Thanks, Jian On 8/13/05, Mag Gam [EMAIL PROTECTED] wrote: Are there any documens or plans to integrate Lucene With Apache Derby (database)? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DOM or XML representation of a query?
Well, the good practice I think is to decouple the backend from the front end as much as possible. You might have different versions of java running for each end and also, there might be code compatibility issues with different versions. Jian On 8/10/05, Andrew Boyd [EMAIL PROTECTED] wrote: Query is Serializable why not use that? -Original Message- From: Roy Klein [EMAIL PROTECTED] Sent: Aug 10, 2005 10:08 AM To: java-user@lucene.apache.org Subject: DOM or XML representation of a query? Hi, The front-end guys working on my application need a way to pass me complex queries. I was thinking that it'd be pretty straightforward to hand them a package which helps them to create a DOM object which describes a query (i.e. nested Booleans combined with phrases and keyword searches, sort by field, etc). I did a few searches in the archive of this list, but didn't find any examples, however, I suspect it's a common requirement amongst members of this list. Can anybody point be at an example of the above? Thanks! Roy - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Andrew Boyd Software Architect Sun Certified J2EE Architect BB Technical Services Inc. 205.422.2557 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many open files error using tomcat and lucene
Hi, Dan, I think the problem you mentioned is the one that has been discussed lot of times in this mailing list. Bottomline is that you'd better use the compound file format to store indexes. I am not sure Lucene 1.3 has that available, but, if possible, can you upgrade to lucene 1.4.3? Cheers, Jian On 7/20/05, Dan Pelton [EMAIL PROTECTED] wrote: We are getting the following error in our tomcat error log. /dsk1/db/lucene/journals/_clr.f7 (Too many open files) java.io.FileNotFoundException: /dsk1/db/lucene/journals/_clr.f7 (Too many open files) at java.io.RandomAccessFile.open(Native Method) We are using the following lucene-1.3-final SunOS thor 5.8 Generic_117350-21 sun4u sparc SUNW,Ultra-250 tomcat 4.1.34 Java 1.4.2 Does any one have any idea how to resolve this. Is it an OS, java or tomcat problem. thanks, Dan P. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene index integrity during a system crash
Hi, Otis, Thanks for your email. As this is very important for using Lucene in our production system, I looked at the code to try to understand. Here is my observation why the index won't be corrupted during a system crash. In the IndexWriter.java mergeSegments(...) method, there are two lines: segmentInfos.write(directory);// commit before deleting deleteSegments(segmentsToDelete);//delete unused segments The sgementInfos.write(...) writes the new segments file as segments.new, once the write is complete, it renames segments.new to segments. I guess the rename operation is atomic as guaranteed by the operating system. Otherwise, the segments file will be left in an inconsistent state during the system crash. It also appears to me that the segments file is the single point to switch from old set of index segments to new ones. In case of a system failure, the old segments file will be used anyway, so, no corruption. Is this understanding correct and thorough? Thanks a lot, Jian On 7/16/05, Otis Gospodnetic [EMAIL PROTECTED] wrote: The only corruption that I've seen mentioned on this list so far was the corruption of the segments file, and even that people have been able to manually edit with a hex editor. Otis --- jian chen [EMAIL PROTECTED] wrote: Hi, I know Lucene does not have transaction support at this stage. However, I want to know what will happen if there is an operating system crash during the indexing process, will the Lucene index got corrupted? Thanks, Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene index integrity during a system crash
Thanks Otis and Nikhil for your confirmation. I am more confident about the Lucene index integrity. Nikhil, I don't see th reason why there is a corrupted .fdx file. Could it be caused by multi-threaded access to the index? Otis, I don't remember I asked about locking questions the other day. I think it should be another guy. Thanks all, Jian On 7/16/05, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi Jian, Yes, I think what you describes is correct. You may end up with some junk index segments in the index directory, but as long as they are not recorded in segments file, they are irrelevant. Otis P.S. Did you ask you locking in Lucene the other day? --- jian chen [EMAIL PROTECTED] wrote: Hi, Otis, Thanks for your email. As this is very important for using Lucene in our production system, I looked at the code to try to understand. Here is my observation why the index won't be corrupted during a system crash. In the IndexWriter.java mergeSegments(...) method, there are two lines: segmentInfos.write(directory); // commit before deleting deleteSegments(segmentsToDelete);//delete unused segments The sgementInfos.write(...) writes the new segments file as segments.new, once the write is complete, it renames segments.new to segments. I guess the rename operation is atomic as guaranteed by the operating system. Otherwise, the segments file will be left in an inconsistent state during the system crash. It also appears to me that the segments file is the single point to switch from old set of index segments to new ones. In case of a system failure, the old segments file will be used anyway, so, no corruption. Is this understanding correct and thorough? Thanks a lot, Jian On 7/16/05, Otis Gospodnetic [EMAIL PROTECTED] wrote: The only corruption that I've seen mentioned on this list so far was the corruption of the segments file, and even that people have been able to manually edit with a hex editor. Otis --- jian chen [EMAIL PROTECTED] wrote: Hi, I know Lucene does not have transaction support at this stage. However, I want to know what will happen if there is an operating system crash during the indexing process, will the Lucene index got corrupted? Thanks, Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: non-lexical comparisons
Yeah, RDBMS makes sense. In this case, would it be better to simple store those in a relational database and just use Lucene to do indexing for the text? Cheers, Jian On 7/7/05, Leos Literak [EMAIL PROTECTED] wrote: I know the answear, but just for curiosity: have you guys ever thought about non-lexical comparison support? For example I started to index number of replies in discussion, so I can find questions without answear, with one reply, two comments etc. But I cannot simply express that I want to find questions with more than five comments (there are ways using regexp, but I dont consider them as simple). Probably such feature belongs to RDBMS rather than to fulltext library .. I am just interested in you opinion. (I expect, that my users will raise the question, why they cannot use such condition so I ask in advance). Leos - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Retrieval model used by Lucene
Well, I guess Lucene's Span query uses the Cover Density based model (proximity model). However, it is within the framework of the TF*IDF as well. Jian On 7/4/05, Dave Kor [EMAIL PROTECTED] wrote: Quoting [EMAIL PROTECTED]: Hi everybody, which kind of retrieval model is lucene using? Is it a simple vector model, a extended boolean model or another model? A reliable source with information about it would be fine, cause every source i found is telling something different. :) Lucene uses the standard vector space model, basically TF*IDF. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: No.of Files in Directory
Hi, My second suggestion is basically to store the user documents (word docs) directly in lucene index. 1) If you are using Lucene 1.4.3, you can do something like this: // suppose the word docs are now in byte array byte[] wordDoc = getUploadedWordDoc(); // add the byte array to lucene index Document doc = new Document(); doc.add(Field.UnIndexed(originalDoc, getBase64(wordDoc))); The getBase64 method basically transforms the bytes into ASCII text, as follows: String getBase64(byte[] wordDoc) { byte[] chars = Base64.encodeBase64(wordDoc); String encodedStr = new String(chars, US-ASCII); return encodedStr; } You can get the Base64.java from http://jakarta.apache.org/commons/codec/apidocs/org/apache/commons/codec/binary/Base64.html 2) Correct me if I am wrong, I think the latest Lucene dev base has the capability to direct add binary content to the Lucene index. Looking at http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/document/Field.java?view=markup It has: /** * Create a stored field with binary value. Optionally the value may be compressed. * * @param name The name of the field * @param value The binary value * @param store How codevalue/code should be stored (compressed or not.) */ public Field(String name, byte[] value, Store store) { . So, I guess if you use the lastest Lucene dev base, you can do: byte[] wordDoc = getUploadedWordDoc(); Document doc = new Document(); doc.add(new Field(originalDoc, wordDoc(), Store.YES); I think Lucene index is pretty good in terms of storing millions of small documents. However, there are two concerns that you might address: 1) no transaction support for the index manipulation. I am not sure what happens when the program is storing the original word document meanwhile the machine gets shut down. Will the index be corrupted? 2) Since Lucene index is basically files in a physical directory, the index file size could eventually hit a hard limit and then you have to have another way to get around it. (Split up the index into two indexes, or, you could configure Lucene for the IndexWriter.DEFAULT_MAX_MERGE_DOCS?) For example, I think some version of windoze (e.g., using FAT file system), has a file size limit of 2GB. Let me know if this helps. Cheers, Jian On 6/29/05, bib_lucene bib [EMAIL PROTECTED] wrote: Thanks Jian I need to retrive the original document sometimes. I did not quite understand your second suggestion. Can you please help me understand better, a pointer to some web resource will also help. jian chen [EMAIL PROTECTED] wrote: Hi, Depending on the operating system, there might be a hard limit on the number of files in one directory (windoze versions). Even with operating systems that don't have a hard limit, it is still better not to put too many files in one directory (linux). Typically, the file system won't be very efficient in terms of file retrieval if there are nore than couple thousand files in one directory. There are some ways to tackle this issue. 1) Use a hash function to distribute the files to different sub directories based on the file name. For example, use the MD5 algorithm in Java or CRC algorithm in java, hash the file name to a number, use this number to construct directory. For example, if the number you hashed is 123456, then, you can make 123 as a sub-dir name, and 456 as the sub-sub dir name, so forth. I think the SQUID web proxy server uses this approach to do the file caching. 2) Why not use Lucene's indexing algorithm and store binary files with lucene index?! I love the indexing algorithm, in that, you don't need to manage the free space like that in a typical file system. Because the merge process will take care of reclaiming the free space automatically. Should these two advices be good? Jian On 6/29/05, bib_lucene bib wrote: Hi All In my webapp i have people uploading their documents. My server is windows/tomcat. I am thinking there will be a limit on the no of files in a directory. Typically apllication users will load 3-5 page word docs. 1. How does one design the system such that there will not be any problem as the users keep uploading their files, even if a million files are reached. 2. Is there a sample application that does this. 3. Should I have lucene update index after each upload or should I do it like once a day. Thanks Bib __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
question regarding the commit.lock
Hi, I am looking at and trying to understand more about Lucene's reader/writer synchronization. Does anyone know when the commit.lock is release? I could not find it anywhere in the source code. I did see the write.lock is released in IndexWriter.close(). Thanks, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Design question [too many fields?]
Hi, Naimdjon, I have some suggestions as well along the lines of Mark Harwood. As an example, suppose for each hotel room there is a description, and you want the user to do free text search on the description field. You could do the following: 1) store hotel room reservation info as rows in a relational database create table reservation ( id int, room_no int, reservation_start_date timestamp, reservation_end_date timestamp, primary key (id) ) 2) store description for each hotel room in Lucene index with two fields, i.e., room_no, description 3) provide the user with free text search in room description as well as availability info like the following: --do full text search on description using the Lucene index --get the room numbers from the search result documents --using these room numbers, look up in the reservation table to see if the user specified start date and end date is not already reserved. --The top serveral rooms that are high on the free text search result and also not reserved will be returned to the user How does this sound? Jian On 6/29/05, Erik Hatcher [EMAIL PROTECTED] wrote: I second Mark's suggestion over the alternative I posted. My alternative was merely to invert the field structure originally described, but using a Filter for the volatile information is wiser. Erik On Jun 29, 2005, at 9:58 AM, mark harwood wrote: Presumably there is also a free-text element to the search or you wouldn't be using Lucene. Multiple fields is not the way to go. A single Lucene field could contain multiple terms ( the available dates) but I still don't think that's the best solution. The availability info is likely to be pretty volatile and you always want up-to-date info so I would prefer to hit a database for this. If you keep a DB primary key to Lucene doc id look-up cached in memory you can quickly construct a Lucene filter from the database results and therefore only show Lucene results for available rooms. Cheers Mark ___ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strategy for making short documents not bubble to the top?
Hi, I would use pure span or cover density based ranking algorithm which do not take document length into consideration. (tweaking whatever currently in the standard Lucene distribution?) For example, searching for the keywords beautiful house, span/cover ranking will treat a long document and a short document the same ranking as long as they have the same number of spans/covers (for example, beautiful xx house is one cover), and with each span/cover, the editing distance between the keywords is the same. Just my 2 cents, Cheers, Jian On 29 Jun 2005 20:30:49 -, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, Short documents bubble to the top of the results because the field length is short. Does anyone have a good strategy for working around this? Will doing something like log(document length) flatten out my results while still making them meaningful? I'm going to try some different approaches but any advice is appreciated. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: No.of Files in Directory
Hi, Depending on the operating system, there might be a hard limit on the number of files in one directory (windoze versions). Even with operating systems that don't have a hard limit, it is still better not to put too many files in one directory (linux). Typically, the file system won't be very efficient in terms of file retrieval if there are nore than couple thousand files in one directory. There are some ways to tackle this issue. 1) Use a hash function to distribute the files to different sub directories based on the file name. For example, use the MD5 algorithm in Java or CRC algorithm in java, hash the file name to a number, use this number to construct directory. For example, if the number you hashed is 123456, then, you can make 123 as a sub-dir name, and 456 as the sub-sub dir name, so forth. I think the SQUID web proxy server uses this approach to do the file caching. 2) Why not use Lucene's indexing algorithm and store binary files with lucene index?! I love the indexing algorithm, in that, you don't need to manage the free space like that in a typical file system. Because the merge process will take care of reclaiming the free space automatically. Should these two advices be good? Jian On 6/29/05, bib_lucene bib [EMAIL PROTECTED] wrote: Hi All In my webapp i have people uploading their documents. My server is windows/tomcat. I am thinking there will be a limit on the no of files in a directory. Typically apllication users will load 3-5 page word docs. 1. How does one design the system such that there will not be any problem as the users keep uploading their files, even if a million files are reached. 2. Is there a sample application that does this. 3. Should I have lucene update index after each upload or should I do it like once a day. Thanks Bib __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lock File exceptions
Hi, Recently I looked at the locking mechanism of Lucene. If I am correct, I think the process for grabbing the lock file will time out by default in 10 seconds. When the process timed out, it will print out the IOException. The lucene locking mechanism is not within threads in the same JVM. It uses lock files so that other processes (even perl program) could also be synchronized in terms of accessing the index. The implementation of the current Lucene locking mechanism uses polling mechanism, i.e, constantly check if the lock file could be obtained. It would be better that a wait/notify mechanism could be used rather than polling. If you don't care about other JVM or process access, maybe you can use the Java 1.5 reader/writer lock mechanism for synchronizing between multiple readers and one writers? Cheers, Jian On 6/27/05, Yousef Ourabi [EMAIL PROTECTED] wrote: Hello: I get this lock-file exception on both Windows and Linux, my app is running inside tomcat 5.5.9, jvm 1.5.03...has anyone seen this before? If I delete the LOCK file it works, but obviously I shouldn't do that...Just wondering what's up? IOException caught here: Lock obtain timed out: Lock@/usr/local/java/jakarta-tomcat-5.5.9/temp/lucene-4f978fb745a946b4dbce87bf411caa25-write.lock Thanks in advance for any help. -Yousef - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fwd: when is the commit.lock released?
Hi, I haven't heard anything back. Probably this email got lost on the way or whatsoever. Anyway, could anyone enlighten me on this? Thanks, Jian -- Forwarded message -- From: jian chen [EMAIL PROTECTED] Date: Jun 26, 2005 12:59 PM Subject: when is the commit.lock released? To: java-user@lucene.apache.org Hi, I am looking at and trying to understand more about Lucene's reader/writer synchronization. Does anyone know when the commit.lock is release? I could not find it anywhere in the source code. I did see the write.lock is released in IndexWriter.close(). Thanks, Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
when is the commit.lock released?
Hi, I am looking at and trying to understand more about Lucene's reader/writer synchronization. Does anyone know when the commit.lock is release? I could not find it anywhere in the source code. I did see the write.lock is released in IndexWriter.close(). Thanks, Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Span query performance issue
Hi, I think Span query in general should do more work than simple Phrase query. Phrase query, in its simplest form, should just try to find all terms that are adjacent to each other. Meanwhile, Span query does not necessary be adjacent to each other, but, with other words in between. Therefore, I think Span query deserves to be slower than Phrase query. This said, Span query is way more powerful than Phrase query. Jian On 25 Jun 2005 00:00:18 -, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, I'm comparing SpanNearQuery to PhraseQuery results and noticing about an 8x difference on Linux. Is a SpanNearQuery doing 8x as much work? I'm considering diving into the code if the results sounds unusual to people. But if its really doing that much more work, I won't spend time optimizing something that can't get much faster. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
document ids in cached in Hits and index merge
Hi, I have a stupid question regarding the transient nature of the document ids. As I understand, documents will obtain new doc ids during the index merge. Suppose if you do a search and got the Hits object. When you iterate through the documents by id, the index merge happens. How the merge and new ids created do not mess up the retrieval of Hits documents? Could anyone please enlighten me on this synchronization issue? Thanks a lot, Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Updateing Documents:
Hi, You may look at this website http://www.zilverline.org Cheers, Jian On 6/21/05, Markus Atteneder [EMAIL PROTECTED] wrote: I am looking for a SearchEngine for our Intranet and so i deal with Lucene. I have read the FAQ and some Postings and i got first experiences with it and now i have some questions. 1. Is lucene a suitable SearchEngine for a Intranetsearch? I've experienced with poi and pdfbox for indexing Word/Excel/PDF files. 2. Files are changing frequently, so the indexing should run at least daily. Is there a possibility out of the box to delete changed files from the index and readd them to the index? I've read that documents only can be deleted if you know the ID of the document in the index and that could change after a optimization of the index. Is there a best practice for that? I thind a full indexing every day is not a good solution because of the datavolume. 3. Does anyone know a project based on lucene that offers a complete solution for a Intranetsearch? -- Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis ++ Jetzt anmelden testen ++ http://www.gmx.net/de/go/promail ++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing multiple languages
Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather than Chinese text tokens. Right now I think maybe I have to write a special analyzer that takes the text input, and detect if the character is an ASCII char, if it is, assembly them together and make it as a token, if not, then, make it as a Chinese word token. So, bottom line is, just one analyzer for all the text and do the if/else statement inside the analyzer. I would like to learn more thoughts about this! Thanks, Jian On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote: Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text content of documents stored in it. Now the system is being used globally, it needs to support multi-language indexing. I've looked through the mailing list archives etc. and it seems it's easy to plug in analyzers for different languages. What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language? I don't fully understand the consequences in terms of performance for 1/, but I can see that false hits could turn up where one word appears in different languages (stemming could increase the changes of this). Also some languages' analyzers are quite dramatically different (e.g. the Chinese one which just treats every character as a separate token/word). On the other hand, if people are searching for proper nouns in metadata (e.g. DSpace) it may be advantageous to search all languages at once. I'm also not sure of the storage and performance consequences of 2/. Approach 3/ seems like it might be the most complex from an implementation/code point of view. Does anyone have any thoughts or recommendations on this? Many thanks, Robert Tansley / Digital Media Systems Programme / HP Labs http://www.hpl.hp.com/personal/Robert_Tansley/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing multiple languages
Hi, Erik, Thanks for your info. No, I haven't tried it yet. I will give it a try and maybe produce some Chinese/English text search demo online. Currently I used Lucene as the indexing engine for Velocity mailing list search. I have a demo at www.jhsystems.net. It is yet another mailing list search for Velocity, but I combined date as well as full text search together. I only used lucene for indexing the textual content, and combined database search with lucene search in returning the results. The other interesting thought I have is: maybe it is possible to use Lucene's merge segments mechanism to write a java based simple file system. Which of course, does not require constant compact operation. The file system could be based on one file only, where segments are just part of the big file. It might be really efficient in terms of adding/deleting the objects all the time. Lastly, any comments welcome for www.jhsystems.net Velocity search. Thanks, Jian www.jhsystems.net On 5/31/05, Erik Hatcher [EMAIL PROTECTED] wrote: Jian - have you tried Lucene's StandardAnalyzer with Chinese? It will keep English as-is (removing stop words, lowercasing, and such) and separate CJK characters into separate tokens also. Erik On May 31, 2005, at 5:49 PM, jian chen wrote: Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather than Chinese text tokens. Right now I think maybe I have to write a special analyzer that takes the text input, and detect if the character is an ASCII char, if it is, assembly them together and make it as a token, if not, then, make it as a Chinese word token. So, bottom line is, just one analyzer for all the text and do the if/else statement inside the analyzer. I would like to learn more thoughts about this! Thanks, Jian On 5/31/05, Tansley, Robert [EMAIL PROTECTED] wrote: Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text content of documents stored in it. Now the system is being used globally, it needs to support multi-language indexing. I've looked through the mailing list archives etc. and it seems it's easy to plug in analyzers for different languages. What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each language? I don't fully understand the consequences in terms of performance for 1/, but I can see that false hits could turn up where one word appears in different languages (stemming could increase the changes of this). Also some languages' analyzers are quite dramatically different (e.g. the Chinese one which just treats every character as a separate token/word). On the other hand, if people are searching for proper nouns in metadata (e.g. DSpace) it may be advantageous to search all languages at once. I'm also not sure of the storage and performance consequences of 2/. Approach 3/ seems like it might be the most complex from an implementation/code point of view. Does anyone have any thoughts or recommendations on this? Many thanks, Robert Tansley / Digital Media Systems Programme / HP Labs http://www.hpl.hp.com/personal/Robert_Tansley/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]