[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

Bruno P. Kinoshita (JIRA) Sat, 10 Mar 2018 12:37:50 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394326#comment-16394326
 ]


Bruno P. Kinoshita commented on JENA-1488:
------------------------------------------

[~code-ferret] your alternative sounds like the best one. Perhaps I should have 
dug deeper into the ConfigurableAnalyzer after your first comment, sorry.

>I'm happy to open a separate ticket on this if there is interest. I've 
>sketched above the essence of the assembler syntax. The implementation will 
>use the same framework as for {{GenericAnalyzerAssembler}} and friends, The 
>{{ConfigurableAnalyzer}} will be modified so that the {{getTokenizer}} and 
>{{getTokenizerFilter}} use a {{Hashtable}}, as in {{Utils.java}}, to retrieve 
>the tokenizers and filters by name.

It does sound like a neat solution. I'm +1 for a separate ticket, and of course 
happy to review/test a pull request/patch.

>What parameter types are need for the {{SelectiveFoldingFilter}}?

Just a java.util.List<Character>, but if necessary we can use a 
String/CharSequence/etc and build the list of chars out of it. This list is 
used as a white-list of characters that are not folded.

Thanks!!!

 

> SelectiveFoldingFilter for jena-text
> ------------------------------------
>
>                 Key: JENA-1488
>                 URL: https://issues.apache.org/jira/browse/JENA-1488
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: Text
>    Affects Versions: Jena 3.6.0
>            Reporter: Osma Suominen
>            Assignee: Bruno P. Kinoshita
>            Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

Reply via email to