Hi Alexandre, I don't have time for a call, but to give you some pointers, Lucene does the following that may be related to natural language processing: - Word segmentation via the `Tokenizer` class. It is rather simple for western languages (including French, see StandardTokenizer), but less for eg. Japanese or Korean which we also support. - We have a couple stemmers implemented via `TokenFilter`s, including for French, see the `org.apache.lucene.analysis.fr` package.
More answers inline below: Le mar. 22 mai 2018 à 17:33, BABAUD Alexandre < alexandre.bab...@soprasteria.com> a écrit : > · What exactly are the type of files the software is able to deal > with? > Lucene doesn't deal with file types directly, you need to be able to pass a string or a stream of characters. If you have a text file, this is easy. If you have PDF files, you will need to use 3rd-party libraries such as Tika to extract content. > · What about data storage? Is it stock in-house? (I am very > concerned about data privacy) > Not really relevant: it's up to you to decide where you store your data. > · Is it easily customizable? > Being a library, I guess the answer is yes.