[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

ASF GitHub Bot (JIRA) Thu, 12 Apr 2018 05:57:21 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435506#comment-16435506
 ]


ASF GitHub Bot commented on JENA-1488:
--------------------------------------

Github user kinow commented on the issue:

    https://github.com/apache/jena/pull/395
  
    Example configuration used for testing:
    
    ```
    @prefix :        <#> .
    @prefix fuseki:  <http://jena.apache.org/fuseki#> .
    @prefix dc:      <http://purl.org/dc/elements/1.1/> .
    @prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
    @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
    @prefix text:    <http://jena.apache.org/text#> .
    @prefix skos:    <http://www.w3.org/2004/02/skos/core#> .
    
    [] ja:loadClass "org.apache.jena.tdb.TDB" .
    tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
    tdb:GraphTDB    rdfs:subClassOf  ja:Model .
    
    [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
    text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
    text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
    
    [] rdf:type fuseki:Server ;
       fuseki:services (
         <#service_text_tdb>
       ) .
    
    <#service_text_tdb> rdf:type fuseki:Service ;
        rdfs:label                      "TDB/text service" ;
        fuseki:name                     "ds" ;
        fuseki:serviceQuery             "query" ;
        fuseki:serviceQuery             "sparql" ;
        fuseki:serviceUpdate            "update" ;
        fuseki:serviceUpload            "upload" ;
        fuseki:serviceReadGraphStore    "get" ;
        fuseki:serviceReadWriteGraphStore    "data" ;
        fuseki:dataset                  :text_dataset ;
    .
    
    :text_dataset rdf:type     text:TextDataset ;
        text:dataset   <#dataset> ;
        text:index     <#indexLucene> ;
        .
    
    <#dataset> rdf:type      tdb:DatasetTDB ;
        tdb:location "/tmp/db" ;
        tdb:unionDefaultGraph true ; # Optional
        .
    
    <#indexLucene> a text:TextIndexLucene ;
        text:directory <file:/tmp/lucene> ;
        text:entityMap <#entMap> ;
        text:storeValues true ;
        text:defineAnalyzers (
          [ 
            text:defineAnalyzer <#configuredAnalyzer> ;
            text:analyzer [
              a text:ConfigurableAnalyzer ;
              text:tokenizer <#tokenizer> ;
              text:filters ( :selectiveFoldingFilter text:LowerCaseFilter )
            ]
          ]
          [
            text:defineTokenizer <#tokenizer> ;
            text:tokenizer [
              a text:GenericTokenizer ;
              text:class "org.apache.lucene.analysis.core.LowerCaseTokenizer" 
            ]
          ]
          [
            text:defineFilter :selectiveFoldingFilter ;
            text:filter [
              a text:GenericFilter ;
              text:class 
"org.apache.jena.query.text.filter.SelectiveFoldingFilter" ;
              text:params (
                [ 
                  text:paramName "whitelisted" ;
                  text:paramType text:TypeSet ;
                  text:paramValue ("ç" "ä")
                ]
              )
            ]
          ]
        ) ;
        text:analyzer [
          a text:DefinedAnalyzer ;
          text:useAnalyzer <#configuredAnalyzer> 
        ] ;
        text:queryAnalyzer [ 
          a text:DefinedAnalyzer ;
          text:useAnalyzer <#configuredAnalyzer> 
        ] ;
        text:queryParser text:AnalyzingQueryParser ;
        text:multilingualSupport true ;
     .
    
    <#entMap> a text:EntityMap ;
        text:defaultField     "pref" ;
        text:entityField      "uri" ;
        text:uidField         "uid" ;
        text:langField        "lang" ;
        text:graphField       "graph" ;
        text:map (
             # skos:prefLabel
             [ text:field "pref" ;
               text:predicate skos:prefLabel
             ]
             # skos:altLabel
             [ text:field "alt" ;
               text:predicate skos:altLabel
             ]
             # skos:hiddenLabel
             [ text:field "hidden" ;
               text:predicate skos:hiddenLabel 
             ]
         ) 
     .
    ```


> SelectiveFoldingFilter for jena-text
> ------------------------------------
>
>                 Key: JENA-1488
>                 URL: https://issues.apache.org/jira/browse/JENA-1488
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: Text
>    Affects Versions: Jena 3.6.0
>            Reporter: Osma Suominen
>            Assignee: Bruno P. Kinoshita
>            Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

Reply via email to