[jira] [Commented] (JENA-1652) jena-text analyzer regression

Code Ferret (JIRA) Sat, 15 Dec 2018 14:05:44 -0800


    [ 
https://issues.apache.org/jira/browse/JENA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722306#comment-16722306
 ]


Code Ferret commented on JENA-1652:
-----------------------------------

The regression was due to Lucene 7.4.0. Issue LUCENE-7355 added:
{code:java}
protected TokenStream normalize(String fieldName, TokenStream in)
{code}
to {{org.apache.lucene.analysis.Analyzer}}.

The new method is called during query parsing and in this case applies:
{code:java}
return new LowerCaseFilter(in);
{code}
The default implementation was the identity which is why the query string was 
not being lower cased.

[PR #512|https://github.com/apache/jena/pull/512] fixes the issue. It is 
unfortunate that the [migration 
guide|https://lucene.apache.org/core/7_6_0/MIGRATE.html] didn't mention that 
external Analyzers would need to possibly {{@Override normalize}}.

I'll close the PR tomorrow if there are no objections.

> jena-text analyzer regression
> -----------------------------
>
>                 Key: JENA-1652
>                 URL: https://issues.apache.org/jira/browse/JENA-1652
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: Text
>    Affects Versions: Jena 3.10.0
>         Environment: Ubuntu 16.04
> java version "1.8.0_191"
> Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
>            Reporter: Osma Suominen
>            Assignee: Osma Suominen
>            Priority: Major
>             Fix For: Jena 3.10.0
>
>
> I noticed that Skosmos unit tests are failing when run with Fuseki 3.10 
> snapshots:
> https://github.com/NatLibFi/Skosmos/issues/828
> Digging a bit deeper, it seems that jena-text is no longer applying the 
> analyzer on query strings as it used to in 3.9.0. The most likely reason for 
> this change seems to be the Lucene upgrade (JENA-1621) which may have 
> affected how analyzers are applied.
> Here is the text analyzer configuration I'm using:
> {noformat}
> <#indexLucene> a text:TextIndexLucene ;
>     ##text:directory <file:/tmp/lucene> ;
>     text:directory "mem" ;
>     text:entityMap <#entMap> ;
>     text:storeValues true ;
>     .
> <#entMap> a text:EntityMap ;
>     text:entityField      "uri" ;
>     text:graphField       "graph" ; ## enable graph-specific indexing
>     text:defaultField     "pref" ; ## Must be defined in the text:map
>     text:uidField         "uid" ;
>     text:langField        "lang" ;
>     text:map (
>          # skos:prefLabel
>          [ text:field "pref" ;
>            text:predicate skos:prefLabel ;
>            text:analyzer [ a text:LowerCaseKeywordAnalyzer ] ]
>          # skos:altLabel
>          [ text:field "alt" ;
>            text:predicate skos:altLabel ;
>            text:analyzer [ a text:LowerCaseKeywordAnalyzer ] ]
>          # skos:hiddenLabel
>          [ text:field "hidden" ;
>            text:predicate skos:hiddenLabel ;
>            text:analyzer [ a text:LowerCaseKeywordAnalyzer ] ]
>          ) .
> {noformat}
> Here is a minimal test file that I load into the default graph:
> {noformat}
> <http://example.org/guppy> <http://www.w3.org/2004/02/skos/core#prefLabel> 
> "Guppy"@en-gb .
> {noformat}
> This is the query I'm using:
> {noformat}
> PREFIX text: <http://jena.apache.org/text#>
> SELECT * {
>   ?s text:query 'G*' .
> }
> {noformat}
> It returns one row (?s=<http://example.org/guppy>) on Fuseki 3.9.0 but 
> nothing with today's 3.10 snapshot.
> If I change the 'G*' to lowercase 'g*' then I get the expected match also 
> with the 3.10 snapshot. So the analyzer (which should lowercase everything 
> and thus the case of the query string should be irrelevant) seems not to be 
> applied for the query string.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (JENA-1652) jena-text analyzer regression

Reply via email to