[
https://issues.apache.org/jira/browse/JENA-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991714#comment-14991714
]
ASF GitHub Bot commented on JENA-1062:
--------------------------------------
Github user ajs6f commented on a diff in the pull request:
https://github.com/apache/jena/pull/97#discussion_r44015443
--- Diff:
jena-text/src/test/java/org/apache/jena/query/text/TestDatasetWithConfigurableAnalyzer.java
---
@@ -0,0 +1,63 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text;
+
+import java.util.Arrays ;
+import java.util.HashSet ;
+import java.util.Set ;
+
+import org.apache.jena.atlas.lib.StrUtils ;
+import org.junit.Before ;
+import org.junit.Test ;
+
+/**
+ * This class defines a setup configuration for a dataset that uses an
ASCII folding lowercase keyword analyzer with a Lucene index.
+ */
+public class TestDatasetWithConfigurableAnalyzer extends
TestDatasetWithLowerCaseKeywordAnalyzer {
+ @Override
+ @Before
+ public void before() {
+ init(StrUtils.strjoinNL(
+ "text:ConfigurableAnalyzer ;",
+ "text:tokenizer text:KeywordTokenizer ;",
+ "text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)"
+ ));
+ }
+
+ @Test
+ public void testConfigurableAnalyzerIsCaseAndAccentInsensitive() {
+ final String testName =
"testConfigurableAnalyzerIsCaseAndAccentInsensitive";
+ final String turtle = StrUtils.strjoinNL(
+ TURTLE_PROLOG,
+ "<" + RESOURCE_BASE + testName + ">",
+ " rdfs:label 'Feeling a déjà vu'",
+ "."
+ );
+ String queryString = StrUtils.strjoinNL(
+ QUERY_PROLOG,
+ "SELECT ?s",
+ "WHERE {",
+ " ?s text:query ( rdfs:label '\"feeling ä déja\"*' 10 )
.",
+ "}"
+ );
+ Set<String> expectedURIs = new HashSet<>() ;
--- End diff --
Our shaded Guava offers various `Sets::newHashSet` functions for this kind
of use, which read a pinch better.
> add ConfigurableAnalyzer to jena-text
> -------------------------------------
>
> Key: JENA-1062
> URL: https://issues.apache.org/jira/browse/JENA-1062
> Project: Apache Jena
> Issue Type: New Feature
> Components: Text
> Reporter: Osma Suominen
> Assignee: Osma Suominen
>
> This is an alternative to JENA-1058 (which implemented a very specific Lucene
> Analyzer for jena-text). The idea here, based on a comment by Claude Warren
> on JENA-1058, is to provide a ConfigurableAnalyzer that can be configured
> with a Tokenizer and (optionally) one or more TokenFilters, like this:
> text:analyzer [
> a text:ConfigurableAnalyzer ;
> text:tokenizer text:KeywordTokenizer ;
> text:filters (text:ASCIIFoldingFilter, text:LowerCaseFilter)
> ]
> I have some code ready to implement this and will open a PR shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)