[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

Bruno P. Kinoshita (JIRA) Sat, 10 Mar 2018 01:24:06 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394112#comment-16394112
 ]


Bruno P. Kinoshita commented on JENA-1488:
------------------------------------------

Had a bit more of spare time today, so had a refresh course on Lucene 
analyzers, and also read the code and docs for Jena Text.

Right now we have a filter, that may possibly work for this issue. In order to 
use it from Jena, I believe we have the following options, in no special order:
 * Modify the ConfigurableAnalyzer to support filters with parameters (though I 
think changing the ConfigurableAnalyzer could cause some incompatibility for 
users, and would have to have its own ticket).
 * Add a `setAccessible(true)` to the constructor found via reflection in the 
GenericAnalyzerAssembler, allowing the use of CustomAnalyzer (not quite 
elegant, as we are supposed to use the builder provided by the analyzer, and 
setAccessible may fail in different environments due to security constraints).
 * Create an analyzer that uses the selective folding filter.

Thoughts? Any other alternatives?

> SelectiveFoldingFilter for jena-text
> ------------------------------------
>
>                 Key: JENA-1488
>                 URL: https://issues.apache.org/jira/browse/JENA-1488
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: Text
>    Affects Versions: Jena 3.6.0
>            Reporter: Osma Suominen
>            Assignee: Bruno P. Kinoshita
>            Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

Reply via email to