Indeed from time to time I have to read lsearch2 code to understand what
was done before cirrus was deployed.
Concerning Russian I think we do, apparently lsearchd used a simple wrapper
to the lucene russian stemmer [1]. If there are some other custom code or
if you are aware of some regressions I'd appreciate some links so we can
track them. I remember having seen some code (js gadgets?) that does some
custom russian stemming...

Concerning Hebrew I hope we can find a good analyzer, according to the
comments in the code the hebrew analyzer that was tested appeared to be
unstable and was disabled. I hope that things are in better shape now, this
is the whole purpose of this new goal, allocate some "official" bandwidth
to fixing/improving language analyzers.

One of the problem we will have to address is the maintainability of all
these language analyzers, we decided to start with polish because one of
them is supported by elastic itself. This is a guarantee for us that the
code will always be up to date. There are many analyzers but too frequently
the code is not maintained or too custom to be properly integrated in our
stack.

[1]
https://github.com/wikimedia/operations-debs-lucene-search-2/blob/master/src/org/wikimedia/lsearch/analyzers/RussianStemFilter.java#L20

On Wed, Jan 4, 2017 at 9:28 PM, Federico Leva (Nemo) <[email protected]>
wrote:

> Did we ever look into whether we managed to address all that the custom
> Lucene code used to do, especially for Russian?
> https://wikitech.wikimedia.org/wiki/Search/2013#Search_details_.28Java.29
>
> While we're at it, perhaps Hebrew's tokenization can be improved:
> https://phabricator.wikimedia.org/T154348#2912086
>
> Starting with Polish makes sense, however.
>
> Nemo
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to