[ https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392865#comment-16392865 ]
Bruno P. Kinoshita commented on JENA-1488: ------------------------------------------ Updated my current branch, removing the assembler changes and the analyzer. Now it actually holds one single file, [/org/apache/jena/query/text/filter/SelectiveFoldingFilter.java|https://github.com/apache/jena/compare/apache:46e2f56...kinow:d90ffa0] I have not added tests, not squashed commits, removed main method, etc, as the code may still need some further massaging. The output of the main method now would be: {noformat} TERM = Senora TERM = Siobhan TERM = look TERM = at TERM = that TERM = façade {noformat} So the _façade_ keep the cedilla, as it was whitelisted. If the letter 'ñ' was added to the white-list, then the first term found would actually be _Señora_. After using the white-list, the code delegates it to a method from the existing ASCIIFoldingFilter. Now just need to find a way to rig it up together with Jena text analyzers. I liked [~code-ferret], though I am not entirely sure where/how to update the ConfigurableAnalyzer. I tried using it, and noticed I couldn't pass the white-list when creating an analyzer/filter. > SelectiveFoldingFilter for jena-text > ------------------------------------ > > Key: JENA-1488 > URL: https://issues.apache.org/jira/browse/JENA-1488 > Project: Apache Jena > Issue Type: Improvement > Components: Text > Affects Versions: Jena 3.6.0 > Reporter: Osma Suominen > Assignee: Bruno P. Kinoshita > Priority: Major > > Currently there's some support for accent folding in jena-text, because > Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search > for "deja vu" will match the literal "déjà vu" in the data. > But we can't use it here at the National Library of Finland (for Finto.fi / > Skosmos), because it folds too much! In the Finnish alphabet, in addition to > the Latin a-z (which are in ASCII) we use the letters åäö and these should > not be folded to ASCII. So we need a Lucene analyzer that can be configured > with an exclude list, something like > > new SelectiveFoldingFilter(String excludeChars) > > and that can be also be configured via the Jena assembler just like other > analyzers supported by jena-text. > > This was also briefly discussed on the skosmos-users mailing list: > [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] > Apparently Norwegians have the same problem... > I've discussed this with [~kinow] and he has some initial code to implement > this feature, so I think we can turn this into a PR fairly soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)