[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

Code Ferret (JIRA) Tue, 13 Feb 2018 11:54:24 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16362953#comment-16362953
 ]


Code Ferret commented on JENA-1488:
-----------------------------------

Perhaps adding a new filter, especially one that has configurable arguments 
such as the {{excludeChars}}, is an opportunity to add extensions for defined 
filters and defined tokenizers. I've looked at {{ConfigurableAnalyzer}} and its 
assembler and it should be straightforward.

I would add tokenizer and filter definitions to {{TextIndexLucene}} similar to 
the support for adding analyzers:
{code:java}
    text:defineFilters (
        [ text:defineFilter <#foo> ; 
          text:filter [ 
            a text:GenericFilter ;
            text:class "fi.finto.FoldingFilter" ;
            text:params (
                [ text:paramName "excludeChars" ;
                  text:paramType text:TypeString ; 
                  text:paramValue "whatevercharstoexclude" ]
                )
            ] ; 
          ]
      )
{code}
{{GenericFilterAssembler}} and {{GenericTokenizerAssmbler}} would make use of 
much of the code in {{GenericAnalyzerAssembler}}. The changes to 
{{ConfigurableAnalyzer}} and {{ConfigurableAnalyzerAssembler}} are 
straightforward and mostly involve retaining the resource URI rather than 
extracting the localName.

Such an addition would make it easy to create new tokenizers and filters that 
could be dropped in by just adding the classes onto the jena/fuseki classpath 
and putting the appropriate assembler bits in the configuration.

If there is interest, I should be able to implement this in a PR rather quickly,

> SelectiveFoldingFilter for jena-text
> ------------------------------------
>
>                 Key: JENA-1488
>                 URL: https://issues.apache.org/jira/browse/JENA-1488
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: Text
>    Affects Versions: Jena 3.6.0
>            Reporter: Osma Suominen
>            Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

Reply via email to