[jira] [Created] (JENA-1523) "VARS requires a list of variables" exception w/spilling and renamed vars

2018-04-12 Thread Shawn Smith (JIRA)
Shawn Smith created JENA-1523:
-

 Summary: "VARS requires a list of variables" exception w/spilling 
and renamed vars
 Key: JENA-1523
 URL: https://issues.apache.org/jira/browse/JENA-1523
 Project: Apache Jena
  Issue Type: Bug
  Components: ARQ
Affects Versions: Jena 3.7.0
Reporter: Shawn Smith


Spilling a {{DistinctDataBag}} or {{SortedDataBag}} when executing SPARQL 
queries that are modified by {{TransformScopeRename}} can result in the 
following:
{noformat}
org.apache.jena.riot.RiotException: [line: 1, col: 7 ] VARS requires a list of 
variables (found '[SLASH]')

at 
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:147)
at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
at org.apache.jena.riot.lang.LangEngine.exceptionDirect(LangEngine.java:143)
at org.apache.jena.riot.lang.LangEngine.exception(LangEngine.java:137)
at 
org.apache.jena.sparql.engine.binding.BindingInputStream.access$1900(BindingInputStream.java:64)
at 
org.apache.jena.sparql.engine.binding.BindingInputStream$IteratorTuples.directiveVars(BindingInputStream.java:227)
at 
org.apache.jena.sparql.engine.binding.BindingInputStream$IteratorTuples.directives(BindingInputStream.java:140)
at 
org.apache.jena.sparql.engine.binding.BindingInputStream$IteratorTuples.(BindingInputStream.java:129)
at 
org.apache.jena.sparql.engine.binding.BindingInputStream.(BindingInputStream.java:99)
at 
org.apache.jena.sparql.engine.binding.BindingInputStream.(BindingInputStream.java:78)
at 
org.apache.jena.sparql.engine.binding.BindingInputStream.(BindingInputStream.java:73)
at 
org.apache.jena.riot.system.SerializationFactoryFinder$1.createDeserializer(SerializationFactoryFinder.java:56)
at 
org.apache.jena.atlas.data.SortedDataBag.getInputIterator(SortedDataBag.java:190)
at org.apache.jena.atlas.data.SortedDataBag.iterator(SortedDataBag.java:235)
at org.apache.jena.atlas.data.SortedDataBag.iterator(SortedDataBag.java:206)
at 
org.apache.jena.atlas.data.DistinctDataBag.iterator(DistinctDataBag.java:94){noformat}
The problem is that renaming variables prepends a "/" so that, for example, the 
first line of the spill file might look like the following which 
{{BindingInputStream.directiveVars()}} can't parse:
{noformat}
VARS ?/.1 ?/.0 ?v_2 ?v_21 ?v_1 .{noformat}
Here's a test case that reproduces the exception:
{noformat}
@Test
public void testWithRenamedVars() {
ExprVar expr = (ExprVar) Rename.renameVars(new ExprVar("1"), 
Collections.emptySet());

BindingMap binding = BindingFactory.create();
binding.add(expr.asVar(), NodeFactory.createLiteral("foo"));

SortedDataBag dataBag = BagFactory.newSortedBag(
new ThresholdPolicyCount<>(0),
SerializationFactoryFinder.bindingSerializationFactory(),
new BindingComparator(new ArrayList<>()));
try {
dataBag.add(binding);
dataBag.flush();

// Spill file looks like the following:
// VARS ?/1 .
// "foo" .

Binding actual = dataBag.iterator().next();
assertEquals(binding, actual);
} finally {
dataBag.close();
}
}
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436405#comment-16436405
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/395#discussion_r181240620
  
--- Diff: 
jena-text/src/test/java/org/apache/jena/query/text/filter/TestSelectiveFoldingFilter.java
 ---
@@ -0,0 +1,135 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.filter;
+
+import static org.junit.Assert.assertTrue;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.standard.StandardTokenizer;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.junit.Before;
+import org.junit.Test;
+
+/**
+ * Test {@link SelectiveFoldingFilter}.
+ */
+
+public class TestSelectiveFoldingFilter {
+
+private StringReader inputText;
+private CharArraySet whitelisted;
+
+@Before
+public void setUp() {
+inputText = new StringReader("Señora Siobhán, look at that 
façade");
+}
+
+/**
+ * An empty white list means that the default behaviour of the 
Lucene's ASCIIFoldingFilter applies.
+ * @throws IOException from Lucene API
+ */
+@Test
+public void testEmptyWhiteListIsOkay() throws IOException {
+whitelisted = new CharArraySet(Collections.emptyList(), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testSingleCharacterWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCompleteWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+// here we should have the complete input
+List expected = Arrays.asList("Señora", "Siobhán", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCaseMatters() throws IOException {
+// note the first capital letter
+whitelisted = new CharArraySet(Arrays.asList("Ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhán", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testMismatchWhiteList() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ú", "ć", "ž"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test(expected = NullPointerException.class)
+public void testNullWhiteListThrowsError() throws IOException {
+collectTokens(inputText, null);
+}
+
+@Test
+public void testEmptyInput() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+inputText = new StringReader("");
+List tokens = 

[GitHub] jena pull request #395: JENA-1488: add a selective folding analyzer

2018-04-12 Thread kinow
Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/395#discussion_r181240620
  
--- Diff: 
jena-text/src/test/java/org/apache/jena/query/text/filter/TestSelectiveFoldingFilter.java
 ---
@@ -0,0 +1,135 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.filter;
+
+import static org.junit.Assert.assertTrue;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.standard.StandardTokenizer;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.junit.Before;
+import org.junit.Test;
+
+/**
+ * Test {@link SelectiveFoldingFilter}.
+ */
+
+public class TestSelectiveFoldingFilter {
+
+private StringReader inputText;
+private CharArraySet whitelisted;
+
+@Before
+public void setUp() {
+inputText = new StringReader("Señora Siobhán, look at that 
façade");
+}
+
+/**
+ * An empty white list means that the default behaviour of the 
Lucene's ASCIIFoldingFilter applies.
+ * @throws IOException from Lucene API
+ */
+@Test
+public void testEmptyWhiteListIsOkay() throws IOException {
+whitelisted = new CharArraySet(Collections.emptyList(), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testSingleCharacterWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCompleteWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+// here we should have the complete input
+List expected = Arrays.asList("Señora", "Siobhán", 
"look", "at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCaseMatters() throws IOException {
+// note the first capital letter
+whitelisted = new CharArraySet(Arrays.asList("Ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhán", 
"look", "at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testMismatchWhiteList() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ú", "ć", "ž"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test(expected = NullPointerException.class)
+public void testNullWhiteListThrowsError() throws IOException {
+collectTokens(inputText, null);
+}
+
+@Test
+public void testEmptyInput() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+inputText = new StringReader("");
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Collections.emptyList();
+assertTrue(tokens.equals(expected));
+}
+
+/**
+ * Return the list of CharTermAttribute converted to 

[jira] [Created] (JENA-1522) Unable to consistently retrieve data from large dataset

2018-04-12 Thread Brian Mullen (JIRA)
Brian Mullen created JENA-1522:
--

 Summary: Unable to consistently retrieve data from large dataset
 Key: JENA-1522
 URL: https://issues.apache.org/jira/browse/JENA-1522
 Project: Apache Jena
  Issue Type: Bug
  Components: Fuseki, Jena
Affects Versions: Jena 3.6.0
 Environment: System 1:  Centos 7, Jena 3.6, Unknown Fuseki version.

System 2:  Ubuntu 16.04 running Docker.  Running stain/jena-fuseki from the 
official Docker Hub.

 
Reporter: Brian Mullen


In my 500M+ triple dataset, queries seem to be failing for no clear reason. 
Here's an example.
{code:java}
prefix Products:  
select ?p ?o 
where { 
Products:ABC ?p ?o . 
}
{code}
...results in a list like:
{code:java}
Products:HasComponent Products:DEF 
Products:HasComponent Products:GHI {code}
Now running this query:
{code:java}
prefix Products:  
select ?p 
where { 
Products:ABC ?p Products:DEF . 
} {code}
...has no results. How is this possible?

 

Here's another example.
{code:java}
prefix Products:  
ask where { 
Products:ABC Products:PartNumber ?p . 
filter ( ?p = "ABC" ) 
} {code}
This returns "True"

 
{code:java}
prefix Products:  
ask where { 
?s Products:PartNumber ?p . 
filter ( ?p = "ABC" ) 
} {code}
This returns "False"

 

What other info can I provide?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (JENA-1521) TDB2 backed Datasets cannot be re-opened.

2018-04-12 Thread Greg Albiston (JIRA)

 [ 
https://issues.apache.org/jira/browse/JENA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Albiston updated JENA-1521:

Environment: 
Apache Jena: 3.7.0

Java: 1.8_162

> TDB2 backed Datasets cannot be re-opened.
> -
>
> Key: JENA-1521
> URL: https://issues.apache.org/jira/browse/JENA-1521
> Project: Apache Jena
>  Issue Type: Bug
> Environment: Apache Jena: 3.7.0
> Java: 1.8_162
>Reporter: Greg Albiston
>Priority: Major
>
> If a Dataset connected to with TDB2Factory.connectDataset() is opened, closed 
> and then later re-opened it is reported that the Dataset is closed.
> Opening, closing and re-opening a Dataset with TDBFactory.createDataset() 
> causes no issues.
> Example code to reproduce:
> public void testTDB2OpenClose() {
> System.out.println("TDB2 Open Close");
>  try {
>  Dataset dataset = TDB2Factory.connectDataset("test_tdb2");
>  dataset.begin(ReadWrite.WRITE);
>  Model defaultModel = dataset.getDefaultModel();
>  
> defaultModel.add(ResourceFactory.createResource("http://example.org/my#SubjA;),
>  ResourceFactory.createProperty("http://example.org/my#PropA;), 
> ResourceFactory.createResource("http://example.org/my#ObjA;));
>  dataset.commit();
>  dataset.end();
>  dataset.close();
> Dataset dataset2 = TDB2Factory.connectDataset("test_tdb2");
>  dataset2.begin(ReadWrite.READ);
>  Model readModel = dataset2.getDefaultModel();
>  Iterator statements = readModel.listStatements();
>  while (statements.hasNext()) {
>  Statement statement = statements.next();
>  System.out.println(statement);
>  }
>  dataset2.end();
>  dataset2.close();
>  } catch (Exception ex) {
>  System.out.println("Exception: " + ex.getMessage());
>  }
>  }
>  public void testTDB1OpenClose() {
> System.out.println("TDB1 Open Close");
>  try {
>  Dataset dataset = TDBFactory.createDataset("test_tdb1");
>  dataset.begin(ReadWrite.WRITE);
>  Model defaultModel = dataset.getDefaultModel();
>  
> defaultModel.add(ResourceFactory.createResource("http://example.org/my#SubjA;),
>  ResourceFactory.createProperty("http://example.org/my#PropA;), 
> ResourceFactory.createResource("http://example.org/my#ObjA;));
>  dataset.commit();
>  dataset.end();
>  dataset.close();
> Dataset dataset2 = TDBFactory.createDataset("test_tdb1");
>  dataset2.begin(ReadWrite.READ);
>  Model readModel = dataset2.getDefaultModel();
>  Iterator statements = readModel.listStatements();
>  while (statements.hasNext()) {
>  Statement statement = statements.next();
>  System.out.println(statement);
>  }
>  dataset2.end();
>  dataset2.close();
>  } catch (Exception ex) {
>  System.out.println("Exception: " + ex.getMessage());
>  }
>  }
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (JENA-1521) TDB2 backed Datasets cannot be re-opened.

2018-04-12 Thread Greg Albiston (JIRA)
Greg Albiston created JENA-1521:
---

 Summary: TDB2 backed Datasets cannot be re-opened.
 Key: JENA-1521
 URL: https://issues.apache.org/jira/browse/JENA-1521
 Project: Apache Jena
  Issue Type: Bug
Reporter: Greg Albiston


If a Dataset connected to with TDB2Factory.connectDataset() is opened, closed 
and then later re-opened it is reported that the Dataset is closed.

Opening, closing and re-opening a Dataset with TDBFactory.createDataset() 
causes no issues.

Example code to reproduce:

public void testTDB2OpenClose() {

System.out.println("TDB2 Open Close");
 try {
 Dataset dataset = TDB2Factory.connectDataset("test_tdb2");
 dataset.begin(ReadWrite.WRITE);
 Model defaultModel = dataset.getDefaultModel();
 
defaultModel.add(ResourceFactory.createResource("http://example.org/my#SubjA;), 
ResourceFactory.createProperty("http://example.org/my#PropA;), 
ResourceFactory.createResource("http://example.org/my#ObjA;));
 dataset.commit();
 dataset.end();
 dataset.close();

Dataset dataset2 = TDB2Factory.connectDataset("test_tdb2");
 dataset2.begin(ReadWrite.READ);
 Model readModel = dataset2.getDefaultModel();
 Iterator statements = readModel.listStatements();
 while (statements.hasNext()) {
 Statement statement = statements.next();
 System.out.println(statement);
 }
 dataset2.end();
 dataset2.close();
 } catch (Exception ex) {
 System.out.println("Exception: " + ex.getMessage());
 }
 }


 public void testTDB1OpenClose() {

System.out.println("TDB1 Open Close");
 try {
 Dataset dataset = TDBFactory.createDataset("test_tdb1");
 dataset.begin(ReadWrite.WRITE);
 Model defaultModel = dataset.getDefaultModel();
 
defaultModel.add(ResourceFactory.createResource("http://example.org/my#SubjA;), 
ResourceFactory.createProperty("http://example.org/my#PropA;), 
ResourceFactory.createResource("http://example.org/my#ObjA;));
 dataset.commit();
 dataset.end();
 dataset.close();

Dataset dataset2 = TDBFactory.createDataset("test_tdb1");
 dataset2.begin(ReadWrite.READ);
 Model readModel = dataset2.getDefaultModel();
 Iterator statements = readModel.listStatements();
 while (statements.hasNext()) {
 Statement statement = statements.next();
 System.out.println(statement);
 }
 dataset2.end();
 dataset2.close();
 } catch (Exception ex) {
 System.out.println("Exception: " + ex.getMessage());
 }
 }

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] jena pull request #396: JENA-1520: tdb2.tdbstats: cmd and fix for rdf:type

2018-04-12 Thread afs
GitHub user afs opened a pull request:

https://github.com/apache/jena/pull/396

JENA-1520: tdb2.tdbstats: cmd and fix for rdf:type



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/afs/jena tdb2-tdbstats

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/jena/pull/396.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #396


commit 55516f717caba7d41c4b78ca298f4c030cedcddf
Author: Andy Seaborne 
Date:   2018-04-12T16:18:29Z

JENA-1520: tdb2.tdbstats: cmd and fix for rdf:type




---


[jira] [Commented] (JENA-1520) tdb2.tdbstats

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435883#comment-16435883
 ] 

ASF GitHub Bot commented on JENA-1520:
--

GitHub user afs opened a pull request:

https://github.com/apache/jena/pull/396

JENA-1520: tdb2.tdbstats: cmd and fix for rdf:type



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/afs/jena tdb2-tdbstats

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/jena/pull/396.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #396


commit 55516f717caba7d41c4b78ca298f4c030cedcddf
Author: Andy Seaborne 
Date:   2018-04-12T16:18:29Z

JENA-1520: tdb2.tdbstats: cmd and fix for rdf:type




> tdb2.tdbstats
> -
>
> Key: JENA-1520
> URL: https://issues.apache.org/jira/browse/JENA-1520
> Project: Apache Jena
>  Issue Type: Bug
>  Components: TDB2
>Affects Versions: Jena 3.7.0
>Reporter: Andy Seaborne
>Assignee: Andy Seaborne
>Priority: Minor
> Fix For: Jena 3.8.0
>
>
> {{tdb2.tdbstats}} works mostly but breaks when trying to find the 
> {{rdf:type}} node.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (JENA-1520) tdb2.tdbstats

2018-04-12 Thread Andy Seaborne (JIRA)
Andy Seaborne created JENA-1520:
---

 Summary: tdb2.tdbstats
 Key: JENA-1520
 URL: https://issues.apache.org/jira/browse/JENA-1520
 Project: Apache Jena
  Issue Type: Bug
  Components: TDB2
Affects Versions: Jena 3.7.0
Reporter: Andy Seaborne
Assignee: Andy Seaborne
 Fix For: Jena 3.8.0


{{tdb2.tdbstats}} works mostly but breaks when trying to find the {{rdf:type}} 
node.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435558#comment-16435558
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user rvesse commented on a diff in the pull request:

https://github.com/apache/jena/pull/395#discussion_r181083359
  
--- Diff: 
jena-text/src/test/java/org/apache/jena/query/text/filter/TestSelectiveFoldingFilter.java
 ---
@@ -0,0 +1,135 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.filter;
+
+import static org.junit.Assert.assertTrue;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.standard.StandardTokenizer;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.junit.Before;
+import org.junit.Test;
+
+/**
+ * Test {@link SelectiveFoldingFilter}.
+ */
+
+public class TestSelectiveFoldingFilter {
+
+private StringReader inputText;
+private CharArraySet whitelisted;
+
+@Before
+public void setUp() {
+inputText = new StringReader("Señora Siobhán, look at that 
façade");
+}
+
+/**
+ * An empty white list means that the default behaviour of the 
Lucene's ASCIIFoldingFilter applies.
+ * @throws IOException from Lucene API
+ */
+@Test
+public void testEmptyWhiteListIsOkay() throws IOException {
+whitelisted = new CharArraySet(Collections.emptyList(), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testSingleCharacterWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCompleteWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+// here we should have the complete input
+List expected = Arrays.asList("Señora", "Siobhán", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCaseMatters() throws IOException {
+// note the first capital letter
+whitelisted = new CharArraySet(Arrays.asList("Ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhán", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testMismatchWhiteList() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ú", "ć", "ž"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test(expected = NullPointerException.class)
+public void testNullWhiteListThrowsError() throws IOException {
+collectTokens(inputText, null);
+}
+
+@Test
+public void testEmptyInput() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+inputText = new StringReader("");
+List tokens = 

[GitHub] jena pull request #395: JENA-1488: add a selective folding analyzer

2018-04-12 Thread rvesse
Github user rvesse commented on a diff in the pull request:

https://github.com/apache/jena/pull/395#discussion_r181083359
  
--- Diff: 
jena-text/src/test/java/org/apache/jena/query/text/filter/TestSelectiveFoldingFilter.java
 ---
@@ -0,0 +1,135 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.filter;
+
+import static org.junit.Assert.assertTrue;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.standard.StandardTokenizer;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.junit.Before;
+import org.junit.Test;
+
+/**
+ * Test {@link SelectiveFoldingFilter}.
+ */
+
+public class TestSelectiveFoldingFilter {
+
+private StringReader inputText;
+private CharArraySet whitelisted;
+
+@Before
+public void setUp() {
+inputText = new StringReader("Señora Siobhán, look at that 
façade");
+}
+
+/**
+ * An empty white list means that the default behaviour of the 
Lucene's ASCIIFoldingFilter applies.
+ * @throws IOException from Lucene API
+ */
+@Test
+public void testEmptyWhiteListIsOkay() throws IOException {
+whitelisted = new CharArraySet(Collections.emptyList(), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testSingleCharacterWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCompleteWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+// here we should have the complete input
+List expected = Arrays.asList("Señora", "Siobhán", 
"look", "at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCaseMatters() throws IOException {
+// note the first capital letter
+whitelisted = new CharArraySet(Arrays.asList("Ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhán", 
"look", "at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testMismatchWhiteList() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ú", "ć", "ž"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test(expected = NullPointerException.class)
+public void testNullWhiteListThrowsError() throws IOException {
+collectTokens(inputText, null);
+}
+
+@Test
+public void testEmptyInput() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+inputText = new StringReader("");
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Collections.emptyList();
+assertTrue(tokens.equals(expected));
+}
+
+/**
+ * Return the list of CharTermAttribute converted 

[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435515#comment-16435515
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Used `luke` to look at the Lucene index created, and everything checked. 
Had a bit of struggle with the queries, but it was my mistake. 


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] jena issue #395: JENA-1488: add a selective folding analyzer

2018-04-12 Thread kinow
Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Used `luke` to look at the Lucene index created, and everything checked. 
Had a bit of struggle with the queries, but it was my mistake. 


---


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435512#comment-16435512
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
As it is possible to see, that as the configuration white-lists only `ä`, 
the `ö` is escaped with the filter.


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] jena issue #395: JENA-1488: add a selective folding analyzer

2018-04-12 Thread kinow
Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
As it is possible to see, that as the configuration white-lists only `ä`, 
the `ö` is escaped with the filter.


---


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435509#comment-16435509
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  

![screenshot_2018-04-13_00-51-46](https://user-images.githubusercontent.com/304786/38678650-c0c8cfbe-3eb5-11e8-83f7-72ac846cf661.png)

![screenshot_2018-04-13_00-52-12](https://user-images.githubusercontent.com/304786/38678651-c0fd40aa-3eb5-11e8-95f9-30f36e523089.png)



> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435508#comment-16435508
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Then, started Fuseki in Eclipse (FusekiCmd, with --config /.../fuseki.ttl). 
Loading the data file on to the /ds/ endpoint, everything works as expected. I 
loaded a modified `books.ttl` and got the following:


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] jena issue #395: JENA-1488: add a selective folding analyzer

2018-04-12 Thread kinow
Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  

![screenshot_2018-04-13_00-51-46](https://user-images.githubusercontent.com/304786/38678650-c0c8cfbe-3eb5-11e8-83f7-72ac846cf661.png)

![screenshot_2018-04-13_00-52-12](https://user-images.githubusercontent.com/304786/38678651-c0fd40aa-3eb5-11e8-95f9-30f36e523089.png)



---


[GitHub] jena issue #395: JENA-1488: add a selective folding analyzer

2018-04-12 Thread kinow
Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Then, started Fuseki in Eclipse (FusekiCmd, with --config /.../fuseki.ttl). 
Loading the data file on to the /ds/ endpoint, everything works as expected. I 
loaded a modified `books.ttl` and got the following:


---


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435506#comment-16435506
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Example configuration used for testing:

```
@prefix :<#> .
@prefix fuseki:   .
@prefix dc:   .
@prefix rdf:  .
@prefix rdfs: .
@prefix tdb:  .
@prefix ja:   .
@prefix text: .
@prefix skos: .

[] ja:loadClass "org.apache.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDBrdfs:subClassOf  ja:Model .

[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset  rdfs:subClassOf   ja:RDFDataset .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

[] rdf:type fuseki:Server ;
   fuseki:services (
 <#service_text_tdb>
   ) .

<#service_text_tdb> rdf:type fuseki:Service ;
rdfs:label  "TDB/text service" ;
fuseki:name "ds" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
fuseki:serviceUpdate"update" ;
fuseki:serviceUpload"upload" ;
fuseki:serviceReadGraphStore"get" ;
fuseki:serviceReadWriteGraphStore"data" ;
fuseki:dataset  :text_dataset ;
.

:text_dataset rdf:type text:TextDataset ;
text:dataset   <#dataset> ;
text:index <#indexLucene> ;
.

<#dataset> rdf:type  tdb:DatasetTDB ;
tdb:location "/tmp/db" ;
tdb:unionDefaultGraph true ; # Optional
.

<#indexLucene> a text:TextIndexLucene ;
text:directory  ;
text:entityMap <#entMap> ;
text:storeValues true ;
text:defineAnalyzers (
  [ 
text:defineAnalyzer <#configuredAnalyzer> ;
text:analyzer [
  a text:ConfigurableAnalyzer ;
  text:tokenizer <#tokenizer> ;
  text:filters ( :selectiveFoldingFilter text:LowerCaseFilter )
]
  ]
  [
text:defineTokenizer <#tokenizer> ;
text:tokenizer [
  a text:GenericTokenizer ;
  text:class "org.apache.lucene.analysis.core.LowerCaseTokenizer" 
]
  ]
  [
text:defineFilter :selectiveFoldingFilter ;
text:filter [
  a text:GenericFilter ;
  text:class 
"org.apache.jena.query.text.filter.SelectiveFoldingFilter" ;
  text:params (
[ 
  text:paramName "whitelisted" ;
  text:paramType text:TypeSet ;
  text:paramValue ("ç" "ä")
]
  )
]
  ]
) ;
text:analyzer [
  a text:DefinedAnalyzer ;
  text:useAnalyzer <#configuredAnalyzer> 
] ;
text:queryAnalyzer [ 
  a text:DefinedAnalyzer ;
  text:useAnalyzer <#configuredAnalyzer> 
] ;
text:queryParser text:AnalyzingQueryParser ;
text:multilingualSupport true ;
 .

<#entMap> a text:EntityMap ;
text:defaultField "pref" ;
text:entityField  "uri" ;
text:uidField "uid" ;
text:langField"lang" ;
text:graphField   "graph" ;
text:map (
 # skos:prefLabel
 [ text:field "pref" ;
   text:predicate skos:prefLabel
 ]
 # skos:altLabel
 [ text:field "alt" ;
   text:predicate skos:altLabel
 ]
 # skos:hiddenLabel
 [ text:field "hidden" ;
   text:predicate skos:hiddenLabel 
 ]
 ) 
 .
```


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it 

[GitHub] jena issue #395: JENA-1488: add a selective folding analyzer

2018-04-12 Thread kinow
Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Example configuration used for testing:

```
@prefix :<#> .
@prefix fuseki:   .
@prefix dc:   .
@prefix rdf:  .
@prefix rdfs: .
@prefix tdb:  .
@prefix ja:   .
@prefix text: .
@prefix skos: .

[] ja:loadClass "org.apache.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDBrdfs:subClassOf  ja:Model .

[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset  rdfs:subClassOf   ja:RDFDataset .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

[] rdf:type fuseki:Server ;
   fuseki:services (
 <#service_text_tdb>
   ) .

<#service_text_tdb> rdf:type fuseki:Service ;
rdfs:label  "TDB/text service" ;
fuseki:name "ds" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
fuseki:serviceUpdate"update" ;
fuseki:serviceUpload"upload" ;
fuseki:serviceReadGraphStore"get" ;
fuseki:serviceReadWriteGraphStore"data" ;
fuseki:dataset  :text_dataset ;
.

:text_dataset rdf:type text:TextDataset ;
text:dataset   <#dataset> ;
text:index <#indexLucene> ;
.

<#dataset> rdf:type  tdb:DatasetTDB ;
tdb:location "/tmp/db" ;
tdb:unionDefaultGraph true ; # Optional
.

<#indexLucene> a text:TextIndexLucene ;
text:directory  ;
text:entityMap <#entMap> ;
text:storeValues true ;
text:defineAnalyzers (
  [ 
text:defineAnalyzer <#configuredAnalyzer> ;
text:analyzer [
  a text:ConfigurableAnalyzer ;
  text:tokenizer <#tokenizer> ;
  text:filters ( :selectiveFoldingFilter text:LowerCaseFilter )
]
  ]
  [
text:defineTokenizer <#tokenizer> ;
text:tokenizer [
  a text:GenericTokenizer ;
  text:class "org.apache.lucene.analysis.core.LowerCaseTokenizer" 
]
  ]
  [
text:defineFilter :selectiveFoldingFilter ;
text:filter [
  a text:GenericFilter ;
  text:class 
"org.apache.jena.query.text.filter.SelectiveFoldingFilter" ;
  text:params (
[ 
  text:paramName "whitelisted" ;
  text:paramType text:TypeSet ;
  text:paramValue ("ç" "ä")
]
  )
]
  ]
) ;
text:analyzer [
  a text:DefinedAnalyzer ;
  text:useAnalyzer <#configuredAnalyzer> 
] ;
text:queryAnalyzer [ 
  a text:DefinedAnalyzer ;
  text:useAnalyzer <#configuredAnalyzer> 
] ;
text:queryParser text:AnalyzingQueryParser ;
text:multilingualSupport true ;
 .

<#entMap> a text:EntityMap ;
text:defaultField "pref" ;
text:entityField  "uri" ;
text:uidField "uid" ;
text:langField"lang" ;
text:graphField   "graph" ;
text:map (
 # skos:prefLabel
 [ text:field "pref" ;
   text:predicate skos:prefLabel
 ]
 # skos:altLabel
 [ text:field "alt" ;
   text:predicate skos:altLabel
 ]
 # skos:hiddenLabel
 [ text:field "hidden" ;
   text:predicate skos:hiddenLabel 
 ]
 ) 
 .
```


---


[GitHub] jena pull request #395: JENA-1488: add a selective folding analyzer

2018-04-12 Thread kinow
GitHub user kinow opened a pull request:

https://github.com/apache/jena/pull/395

JENA-1488: add a selective folding analyzer

This PR adds a selective folding analyzer, as explained in JENA-1488.

It takes a list of characters, used as a white list. Everything that is not 
in the white list, gets oassed though the existing ASCIIFoldingFilter.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kinow/jena selective-folding-analyzer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/jena/pull/395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #395


commit de1bd22a58f76bbac41d16cb7111ed85b98279cd
Author: Bruno P. Kinoshita 
Date:   2018-04-09T09:38:14Z

JENA-1488: add a selective folding analyzer




---


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435501#comment-16435501
 ] 

ASF GitHub Bot commented on JENA-1488:
--

GitHub user kinow opened a pull request:

https://github.com/apache/jena/pull/395

JENA-1488: add a selective folding analyzer

This PR adds a selective folding analyzer, as explained in JENA-1488.

It takes a list of characters, used as a white list. Everything that is not 
in the white list, gets oassed though the existing ASCIIFoldingFilter.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kinow/jena selective-folding-analyzer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/jena/pull/395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #395


commit de1bd22a58f76bbac41d16cb7111ed85b98279cd
Author: Bruno P. Kinoshita 
Date:   2018-04-09T09:38:14Z

JENA-1488: add a selective folding analyzer




> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)