Bruno P. Kinoshita commented on JENA-1488:
>The DefinedFilter solution sounds like the best from my perspective too.
Agreed! Just re-read [~code-ferret]'s previous comments, then jumped to have a
look at TextIndexLuceneAssembler/TextIndexLucene. And I believe I'm
understanding more what he meant. Will wait for his PR to review/test and see
how the SelectiveFoldingFilter would fit in the solution (I believe I will work
like a charm!).
>I'd still prefer the SelectiveFoldingFilter to live in the Jena codebase (for
>reasons of convenience stated above).
Agreed. IIUC, with the DefinedFilter/DefinedTokenizer approach, we will be able
to use the SelectiveFoldingFilter from my PR, or any other filter/tokenizer
combination from Lucene :D
> SelectiveFoldingFilter for jena-text
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
> Issue Type: Improvement
> Components: Text
> Affects Versions: Jena 3.6.0
> Reporter: Osma Suominen
> Assignee: Bruno P. Kinoshita
> Priority: Major
> Currently there's some support for accent folding in jena-text, because
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi /
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to
> the Latin a-z (which are in ASCII) we use the letters åäö and these should
> not be folded to ASCII. So we need a Lucene analyzer that can be configured
> with an exclude list, something like
> new SelectiveFoldingFilter(String excludeChars)
> and that can be also be configured via the Jena assembler just like other
> analyzers supported by jena-text.
> This was also briefly discussed on the skosmos-users mailing list:
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement
> this feature, so I think we can turn this into a PR fairly soon.
This message was sent by Atlassian JIRA