Re: Looking for more information about Lucene

Adrien Grand Tue, 22 May 2018 23:54:22 -0700

Hi Alexandre,

I don't have time for a call, but to give you some pointers, Lucene does
the following that may be related to natural language processing:
 - Word segmentation via the `Tokenizer` class. It is rather simple for
western languages (including French, see StandardTokenizer), but less for
eg. Japanese or Korean which we also support.
 - We have a couple stemmers implemented via `TokenFilter`s, including for
French, see the `org.apache.lucene.analysis.fr` package.


More answers inline below:


Le mar. 22 mai 2018 à 17:33, BABAUD Alexandre <
alexandre.bab...@soprasteria.com> a écrit :

> ·         What exactly are the type of files the software is able to deal
> with?
>

Lucene doesn't deal with file types directly, you need to be able to pass a
string or a stream of characters. If you have a text file, this is easy. If
you have PDF files, you will need to use 3rd-party libraries such as Tika
to extract content.


> ·         What about data storage? Is it stock in-house? (I am very
> concerned about data privacy)
>
Not really relevant: it's up to you to decide where you store your data.

> ·         Is it easily customizable?
>
Being a library, I guess the answer is yes.

Re: Looking for more information about Lucene

Reply via email to