Re: Questions Lucene

Steven Rowe Tue, 11 Sep 2007 07:46:46 -0700

Hi Durga,

I have moved this discussion to the java-user list, since the java-dev
list is devoted to development of the Java Lucene library, and not to
questions about its capabilities.  My answers are inline below.

[EMAIL PROTECTED] wrote:
>        1) What are the various languages supported by Lucene.? Looks
> like its able to handle only English . We are trying to see if it works
> with Japanese / Chinese and other characters

Lucene's native encoding is the Unicode character set, which covers most
languages.  If you have a way to convert your documents' encoding into
Unicode, then Lucene can index them.  The short answer is that Lucene
can handle most languages.  However, precision and/or recall may suffer
for those languages for which Lucene does not provide any specific
resources.  Read on for the long answer.

In Lucene parlance, "analysis" is the process whereby raw text is
converted to index terms, a.k.a. tokens.  There are three areas of
analysis where Lucene does not fully support all languages:
tokenization, stopword lists, and stemming.  Most of the "analyzers"
(processing pipelines) in Lucene include these three components.

1. Tokenization: Lucene's StandardTokenizer[1] handles languages that
use spaces between words; however, since the Chinese and Japanese
orthographies do not employ word-separating characters,
StandardTokenizer produces single-character tokens for these languages,
rather than single-word tokens.  See the Lucene Sandbox for
CJKTokenizer[2], which instead produces overlapping bigram tokens.
Notably, however, there are no word-segmenting analyzers for Chinese or
Japanese in Lucene.  Contributions are welcome!

2. Stopword lists: Lucene's StandardAnalyzer employs the English
stopword list in StopAnalyzer[3] to remove stopwords.  There are several
analyzers in the Lucene Sandbox that include language-specific stopword
lists.[4]

3. Stemming: In the Lucene Sandbox, there are pre-compiled Snowball
stemmers for several Western languages[5].

>        2) After Lucene indexes a given data set, how does Lucene handle
> incremental / dymanic change in the data. In other words, our data keeps
> changing ; how does Lucene handle this changing data. Does it re-index every
> new file entering this data set ?. Or Does it do it index the data in
> increments ?

Lucene can be used to index incrementally.  It has no document update
functionality, however; one must first delete and then re-add a modified
document.  See my answer to your question #3, below, for more info.

>       3) How does Lucene handle deleted files from a particular data set
> ?. What we are concerned is that, does Lucene automatically figure out
> if a particular file is deleted from the data set ?. and it immediately
> removes the index to that particular file ?

Lucene can handle document deletion.  However, Lucene's index readers
are only aware of the state of an index at the point at which they were
opened -- in order to see the changes introduced by document deletion
(and addition), one must close and re-open the index reader.  This
process can be less than "immediate".

>       4) Please consider the following Scenario. When Lucene is
> given the following files to Index.
> 
>          a) Files under /xyz/abc ( Say x.txt, y.txt, a.txt, b.txt, c.txt
> etc.. )
>          b) Files under /def/ghi ( Say none.txt, dude.txt, hello.txt etc.. )
>            So after Lucene finished indexing these file under these two
> directories. And a subsequent search for say a "key word" in hello.txt
> is made
>          What does Lucene return; does it return i.e the fully qualified
> location of this file ? /def/ghi/hello.txt

Lucene returns an ordered list of matching documents.  Lucene documents
are each comprised of a user-specified set of "field"s.  If you wish to
remember the name of the file from which a document was constructed, you
can store the filename in a field, and then retrieve this field's
contents for a document returned by a search.

>            5) How does Lucene index a particular set of files. I.e
> *based* on key words ?. Based on sentences ? Based on what criterion ?

You decide :).  See the discussion above under your first question.

>       6) is Lucene multi-threaded ?. For example if Lucene is indexing a
> set of files in a given data set, and for example if there is a Huge
> file ( 2 GB file ). Does Lucene index this file in parts (i.e parallely
>            i.e in multi-threaded fashion ? or does it index this file
> sequentially

The Lucene API is not multi-threaded, but it can be used in a
multi-threaded application.  Unless otherwise noted in the API
documentation[6], Lucene methods should be thread-safe.

>      7) Also if a data set has multiple files, does Lucene process each
> file seperately in a different thread ? or does it do it sequentially

Again, Lucene can be used in a multi-threaded application, but it is not
itself multi-threaded.  Populating a single index from multiple threads
is a standard, supported use of the Lucene API.

>      8) Does lucene index only text files ?. We have few data bases is
> it possible for us to Index the data in these data bases ?

Extracting analyzable text from original sources is not part of Lucene's
functionality.  See the FAQ[7] for some information on extracting text
from different file types, and also on indexing databases[8].

>      9) Are there any performance Bench Marks for Lucene

Yes: <http://lucene.apache.org/java/docs/benchmarks.html>.  Also, search
the java-user and java-dev lists.

Steve

[1] StandardTokenizer and StandardAnalyzer:
<http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/>

[2] CJKTokenizer and CJKAnalyzer:
<http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/>

[3] StopAnalyzer:
<http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/StopAnalyzer.java>

[4] Sandbox Analyzers:
<http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/>

[5] Sandbox Snowball stemmers:
<http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/snowball/src/java/net/sf/snowball/ext/>

[6] Lucene trunk API docs:
<http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/index.html>

[7] Lucene FAQ: <http://wiki.apache.org/lucene-java/LuceneFAQ>

[8] How to index a database from the Lucene FAQ:
<http://wiki.apache.org/lucene-java/LuceneFAQ#head-109358021acbfc89456e446740dc2bbf9049950f>

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Questions Lucene

Reply via email to