[jira] [Created] (JENA-1488) SelectiveFoldingFilter for jena-text

Osma Suominen (JIRA) Tue, 13 Feb 2018 05:39:46 -0800

Osma Suominen created JENA-1488:
-----------------------------------

             Summary: SelectiveFoldingFilter for jena-text
                 Key: JENA-1488
                 URL: https://issues.apache.org/jira/browse/JENA-1488
             Project: Apache Jena
          Issue Type: Improvement
          Components: Text
    Affects Versions: Jena 3.6.0
            Reporter: Osma Suominen



Currently there's some support for accent folding in jena-text, because Lucene 
provides an ASCIIFoldingFilter. When this filter is enabled, a search for "deja 
vu" will match the literal "déjà vu" in the data.

But we can't use it here at the National Library of Finland (for Finto.fi / 
Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
the Latin a-z (which are in ASCII) we use the letters åäö and these should not 
be folded to ASCII. So we need a Lucene analyzer that can be configured with an 
exclude list, something like 
 
new SelectiveFoldingFilter(String excludeChars) 
 
and that can be also be configured via the Jena assembler just like other 
analyzers supported by jena-text. 
 
This was also briefly discussed on the skosmos-users mailing list: 
[https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
Apparently Norwegians have the same problem...

I've discussed this with [~kinow] and he has some initial code to implement 
this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (JENA-1488) SelectiveFoldingFilter for jena-text

Reply via email to