Re: Questions Lucene

Erik Hatcher Mon, 10 Sep 2007 17:45:44 -0700


On Sep 10, 2007, at 7:56 PM, [EMAIL PROTECTED] wrote:

1) What are the various languages supported by Lucene.?Looks like its able to handle only English . We are trying to seeif it works with Japanese / Chinese and other characters
            Can some one answer

Lucene internally uses UTF-8 (the Java modified version) so you won'thave any encoding issues. And everything is just text inside theindex, so no problem with Chinese, Japanese, or any other languageI've encountered - but certainly there are language-specificconsiderations such as stemming, stop word removal, and whether to doanything special to tokenize on "words" in non-whitespace-separatedlanguages such as Chinese or use n-gramming, or just simple charactertokenization.

2) After Lucene indexes a given data set, how does Lucenehandle incremental / dymanic change in the data. In other words,our data keeps changing ; howdoes Lucene handle this changing data. Does it re-indexevery new file entering this data set ?. Or Does it do it index thedata in increments ?

There is really no such thing as an "update" operation, so theapplication is responsible for effecting that with a delete and re-add on a per-document basis.

3) How does Lucene handle deleted files from a particulardata set ?. What we are concerned is that, does Luceneautomatically figure out if a particular file is deleted from thedata set ?.and it immediately removes the index to that particularfile ?4) Please consider the following Scenario. When Luceneis given the following files to Index.
a) Files under /xyz/abc ( Say x.txt, y.txt, a.txt, b.txt,c.txt etc.. )b) Files under /def/ghi ( Say none.txt, dude.txt,hello.txt etc.. )So after Lucene finished indexing these file under thesetwo directories. And a subsequent search for say a "key word" inhello.txt is madeWhat does Lucene return; does it return i.e the fullyqualified location of this file ? /def/ghi/hello.txt

Lucene is about text, not files per se. It is your application thatwill map that kind of logic on top of Lucene. Lucene itself knowsnothing of the files you want to index, delete, search - you willbuild that mapping in yourself. Your application will be responsiblefor keeping data and the index in sync.

5) How does Lucene index a particular set of files. I.e*based* on key words ?. Based on sentences ? Based on what criterion ?

Again, it doesn't deal with "files"... your application deals withthat, Lucene is handed text. As for how it makes words in textsearchable - read up on Lucene Analyzers. They break the text intosearchable terms.

6) is Lucene multi-threaded ?. For example if Lucene isindexing a set of files in a given data set, and for example ifthere is a Huge file ( 2 GB file ). Does Lucene index this file inparts (i.e parallely i.e in multi-threaded fashion ? ordoes it index this file sequentially

Lucene is isn't multi-threaded, but most operations are thread-safeso you can parallelize your application to index multiple documentssimultaneously, for example. You may be able to parallelize theparsing of those huge files but you'd need to bring that togetherinto a single Document instance to hand to Lucene's IndexWriter.

7) Also if a data set has multiple files, does Lucene processeach file seperately in a different thread ? or does it do itsequentially


Again, this is up to your application entirely.

8) Does lucene index only text files ?. We have few data basesis it possible for us to Index the data in these data bases ?

See above :) All Lucene cares about is text. How you get text toit matters not to Lucene.

     9) Are there any performance Bench Marks for Lucene

There is a benchmarker framework built into the trunk codebasesuitable for making your own. There's some stuff here: http://lucene.apache.org/java/docs/benchmarks.html and some good stufflinked from http://wiki.apache.org/lucene-java/BasicsOfPerformancethat should get you started.


        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Questions Lucene

Reply via email to