Hi Durga, I have moved this discussion to the java-user list, since the java-dev list is devoted to development of the Java Lucene library, and not to questions about its capabilities. My answers are inline below.
[EMAIL PROTECTED] wrote: > 1) What are the various languages supported by Lucene.? Looks > like its able to handle only English . We are trying to see if it works > with Japanese / Chinese and other characters Lucene's native encoding is the Unicode character set, which covers most languages. If you have a way to convert your documents' encoding into Unicode, then Lucene can index them. The short answer is that Lucene can handle most languages. However, precision and/or recall may suffer for those languages for which Lucene does not provide any specific resources. Read on for the long answer. In Lucene parlance, "analysis" is the process whereby raw text is converted to index terms, a.k.a. tokens. There are three areas of analysis where Lucene does not fully support all languages: tokenization, stopword lists, and stemming. Most of the "analyzers" (processing pipelines) in Lucene include these three components. 1. Tokenization: Lucene's StandardTokenizer[1] handles languages that use spaces between words; however, since the Chinese and Japanese orthographies do not employ word-separating characters, StandardTokenizer produces single-character tokens for these languages, rather than single-word tokens. See the Lucene Sandbox for CJKTokenizer[2], which instead produces overlapping bigram tokens. Notably, however, there are no word-segmenting analyzers for Chinese or Japanese in Lucene. Contributions are welcome! 2. Stopword lists: Lucene's StandardAnalyzer employs the English stopword list in StopAnalyzer[3] to remove stopwords. There are several analyzers in the Lucene Sandbox that include language-specific stopword lists.[4] 3. Stemming: In the Lucene Sandbox, there are pre-compiled Snowball stemmers for several Western languages[5]. > 2) After Lucene indexes a given data set, how does Lucene handle > incremental / dymanic change in the data. In other words, our data keeps > changing ; how does Lucene handle this changing data. Does it re-index every > new file entering this data set ?. Or Does it do it index the data in > increments ? Lucene can be used to index incrementally. It has no document update functionality, however; one must first delete and then re-add a modified document. See my answer to your question #3, below, for more info. > 3) How does Lucene handle deleted files from a particular data set > ?. What we are concerned is that, does Lucene automatically figure out > if a particular file is deleted from the data set ?. and it immediately > removes the index to that particular file ? Lucene can handle document deletion. However, Lucene's index readers are only aware of the state of an index at the point at which they were opened -- in order to see the changes introduced by document deletion (and addition), one must close and re-open the index reader. This process can be less than "immediate". > 4) Please consider the following Scenario. When Lucene is > given the following files to Index. > > a) Files under /xyz/abc ( Say x.txt, y.txt, a.txt, b.txt, c.txt > etc.. ) > b) Files under /def/ghi ( Say none.txt, dude.txt, hello.txt etc.. ) > So after Lucene finished indexing these file under these two > directories. And a subsequent search for say a "key word" in hello.txt > is made > What does Lucene return; does it return i.e the fully qualified > location of this file ? /def/ghi/hello.txt Lucene returns an ordered list of matching documents. Lucene documents are each comprised of a user-specified set of "field"s. If you wish to remember the name of the file from which a document was constructed, you can store the filename in a field, and then retrieve this field's contents for a document returned by a search. > 5) How does Lucene index a particular set of files. I.e > *based* on key words ?. Based on sentences ? Based on what criterion ? You decide :). See the discussion above under your first question. > 6) is Lucene multi-threaded ?. For example if Lucene is indexing a > set of files in a given data set, and for example if there is a Huge > file ( 2 GB file ). Does Lucene index this file in parts (i.e parallely > i.e in multi-threaded fashion ? or does it index this file > sequentially The Lucene API is not multi-threaded, but it can be used in a multi-threaded application. Unless otherwise noted in the API documentation[6], Lucene methods should be thread-safe. > 7) Also if a data set has multiple files, does Lucene process each > file seperately in a different thread ? or does it do it sequentially Again, Lucene can be used in a multi-threaded application, but it is not itself multi-threaded. Populating a single index from multiple threads is a standard, supported use of the Lucene API. > 8) Does lucene index only text files ?. We have few data bases is > it possible for us to Index the data in these data bases ? Extracting analyzable text from original sources is not part of Lucene's functionality. See the FAQ[7] for some information on extracting text from different file types, and also on indexing databases[8]. > 9) Are there any performance Bench Marks for Lucene Yes: <http://lucene.apache.org/java/docs/benchmarks.html>. Also, search the java-user and java-dev lists. Steve [1] StandardTokenizer and StandardAnalyzer: <http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/> [2] CJKTokenizer and CJKAnalyzer: <http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/> [3] StopAnalyzer: <http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java> [4] Sandbox Analyzers: <http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/> [5] Sandbox Snowball stemmers: <http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/snowball/src/java/net/sf/snowball/ext/> [6] Lucene trunk API docs: <http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/index.html> [7] Lucene FAQ: <http://wiki.apache.org/lucene-java/LuceneFAQ> [8] How to index a database from the Lucene FAQ: <http://wiki.apache.org/lucene-java/LuceneFAQ#head-109358021acbfc89456e446740dc2bbf9049950f> -- Steve Rowe Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]