[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-25 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452058#comment-16452058
 ] 

ASF subversion and git services commented on JENA-1488:
---

Commit f3c0abacee7a32057b9b977a0a3a7a443e869560 in jena's branch 
refs/heads/master from [~andy.seaborne]
[ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=f3c0aba ]

JENA-1488: Remove warnigns, use try-resource


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-23 Thread Andy Seaborne (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448358#comment-16448358
 ] 

Andy Seaborne commented on JENA-1488:
-

Can we resolve this now then? 

It really helps if JIRA are "resolved" as soon as the work is done - they can 
be reopened if new information/feedback comes along but if just left open then 
tend to stay open pass their time.

Looking for "Resolved" helps prepare for a release. "Close" at release then 
resets for the next cycle.


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-23 Thread Bruno P. Kinoshita (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447748#comment-16447748
 ] 

Bruno P. Kinoshita commented on JENA-1488:
--

>Maybe the new filter could also be mentioned under "ConfigurableAnalyzer", 
>specifically under "The available TokenFilter implementations are:", where 
>currently ASCIIFoldingFilter is mentioned.

Not kidding, but I thought about adding it there too! I didn't add it because I 
got confused with the GenericAnalyzer. I thought it wouldn't work because it 
didn't have an empty constructor (more especifically, thought it had something 
like this [part of the 
code|https://github.com/apache/jena/blob/a56f7710d88e368824042863e1ebef9afc7fd5f3/jena-text/src/main/java/org/apache/jena/query/text/assembler/GenericAnalyzerAssembler.java#L199]).

My mistake. Added a mention to the new filter under ConfigurableAnalyzer.

>I'm starting to think that the jena-text docs really should be split up into 
>several pages, they are getting hard to navigate. Code Ferret did excellent 
>work cleaning them up some time ago, but now the page is even longer. But 
>further reorganizing the docs is outside the scope of this issue.

+1 ! And maybe it deserves now to have one or more entries under Tutorials as 
well :-)

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-23 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447744#comment-16447744
 ] 

ASF subversion and git services commented on JENA-1488:
---

Commit 1829835 from [~kinow] in branch 'site/trunk'
[ https://svn.apache.org/r1829835 ]

JENA-1488: include SelectiveFoldingFilter to the list of available filters for 
ConfigurableAnalyzer

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-23 Thread Osma Suominen (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447710#comment-16447710
 ] 

Osma Suominen commented on JENA-1488:
-

[~kinow] Ah right! I didn't realize you had already added that part to the 
staging docs. Looks like a good start!

Maybe the new filter could also be mentioned under "ConfigurableAnalyzer", 
specifically under "The available TokenFilter implementations are:", where 
currently ASCIIFoldingFilter is mentioned.

I'm starting to think that the jena-text docs really should be split up into 
several pages, they are getting hard to navigate. [~code-ferret] did excellent 
work cleaning them up some time ago, but now the page is even longer. But 
further reorganizing the docs is outside the scope of this issue.

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-23 Thread Bruno P. Kinoshita (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447695#comment-16447695
 ] 

Bruno P. Kinoshita commented on JENA-1488:
--

[~osma] thanks! Sorry, forgot to mention here, but I've committed something to 
the Subversion repository before closing it, under the "Defined Analyzers" 
section. I thought where would be best to put it... tried a few different 
places, but thought adding a short version to the defined analyzers could be a 
good beginning?

What do you think?

http://jena.staging.apache.org/documentation/query/text-query.html#defined-analyzers

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-23 Thread Osma Suominen (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447684#comment-16447684
 ] 

Osma Suominen commented on JENA-1488:
-

[~kinow] Great work! But I think this needs to be documented too. I thikn the 
best place would be the [jena-text 
docs|https://jena.apache.org/documentation/query/text-query.html]. Can you do 
that as well or should I try?

Reopening the issue, since the normal convention is that issues are closed when 
both the code and docs are ready.

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447037#comment-16447037
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow closed the pull request at:

https://github.com/apache/jena/pull/395


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447036#comment-16447036
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Merged in a56f7710d88e368824042863e1ebef9afc7fd5f3, closing (forgot to use 
the magic words in the commit to close this PR automatically)


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-21 Thread Bruno P. Kinoshita (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447035#comment-16447035
 ] 

Bruno P. Kinoshita commented on JENA-1488:
--

Rebased against master, then switched to master, merged squashing to a single 
commit (easier in case it needs reverting/updates/etc). Ran locally a `mvn 
clean test install -Pdev`, all looking good, pushing to master. Change merged! 
:D

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447034#comment-16447034
 ] 

ASF subversion and git services commented on JENA-1488:
---

Commit a56f7710d88e368824042863e1ebef9afc7fd5f3 in jena's branch 
refs/heads/master from [~brunodepau...@yahoo.com.br]
[ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=a56f771 ]

JENA-1488: add a selective folding analyzer


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-21 Thread Bruno P. Kinoshita (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447032#comment-16447032
 ] 

Bruno P. Kinoshita commented on JENA-1488:
--

Docs added in the SVN commit above. In the docs, it is saying Jena 3.7.1 as the 
release with this functionality. But I used Jena 3.8.0 in this ticket, as 
there's no 3.7.1 planned yet.

In case there's a 3.7.1, I will leave the docs as is, and will check if this 
ticket was moved from Jena 3.8.0 to Jena 3.7.1. Otherwise, in case next version 
is 3.8.0, I will update the docs to reflect it.

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447031#comment-16447031
 ] 

ASF subversion and git services commented on JENA-1488:
---

Commit 1829756 from [~kinow] in branch 'site/trunk'
[ https://svn.apache.org/r1829756 ]

JENA-1488: add selective folding filter documentation

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438679#comment-16438679
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Thanks for testing it @osma ! If there are no objections until next weekend 
I will merge it. With 3.7.0 out, we probably have some more time for testing, 
and then update the documentation. In the meantime, will jump back to JENA-632 
:D


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437350#comment-16437350
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user osma commented on the issue:

https://github.com/apache/jena/pull/395
  
I tested this locally and it seems to work as it should. Great work!


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436405#comment-16436405
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/395#discussion_r181240620
  
--- Diff: 
jena-text/src/test/java/org/apache/jena/query/text/filter/TestSelectiveFoldingFilter.java
 ---
@@ -0,0 +1,135 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.filter;
+
+import static org.junit.Assert.assertTrue;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.standard.StandardTokenizer;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.junit.Before;
+import org.junit.Test;
+
+/**
+ * Test {@link SelectiveFoldingFilter}.
+ */
+
+public class TestSelectiveFoldingFilter {
+
+private StringReader inputText;
+private CharArraySet whitelisted;
+
+@Before
+public void setUp() {
+inputText = new StringReader("Señora Siobhán, look at that 
façade");
+}
+
+/**
+ * An empty white list means that the default behaviour of the 
Lucene's ASCIIFoldingFilter applies.
+ * @throws IOException from Lucene API
+ */
+@Test
+public void testEmptyWhiteListIsOkay() throws IOException {
+whitelisted = new CharArraySet(Collections.emptyList(), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testSingleCharacterWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCompleteWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+// here we should have the complete input
+List expected = Arrays.asList("Señora", "Siobhán", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCaseMatters() throws IOException {
+// note the first capital letter
+whitelisted = new CharArraySet(Arrays.asList("Ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhán", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testMismatchWhiteList() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ú", "ć", "ž"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test(expected = NullPointerException.class)
+public void testNullWhiteListThrowsError() throws IOException {
+collectTokens(inputText, null);
+}
+
+@Test
+public void testEmptyInput() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+inputText = new StringReader("");
+List tokens = 

[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435558#comment-16435558
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user rvesse commented on a diff in the pull request:

https://github.com/apache/jena/pull/395#discussion_r181083359
  
--- Diff: 
jena-text/src/test/java/org/apache/jena/query/text/filter/TestSelectiveFoldingFilter.java
 ---
@@ -0,0 +1,135 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.filter;
+
+import static org.junit.Assert.assertTrue;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.standard.StandardTokenizer;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.junit.Before;
+import org.junit.Test;
+
+/**
+ * Test {@link SelectiveFoldingFilter}.
+ */
+
+public class TestSelectiveFoldingFilter {
+
+private StringReader inputText;
+private CharArraySet whitelisted;
+
+@Before
+public void setUp() {
+inputText = new StringReader("Señora Siobhán, look at that 
façade");
+}
+
+/**
+ * An empty white list means that the default behaviour of the 
Lucene's ASCIIFoldingFilter applies.
+ * @throws IOException from Lucene API
+ */
+@Test
+public void testEmptyWhiteListIsOkay() throws IOException {
+whitelisted = new CharArraySet(Collections.emptyList(), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testSingleCharacterWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCompleteWhiteListed() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+// here we should have the complete input
+List expected = Arrays.asList("Señora", "Siobhán", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testCaseMatters() throws IOException {
+// note the first capital letter
+whitelisted = new CharArraySet(Arrays.asList("Ñ", "á", "ç"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhán", "look", 
"at", "that", "façade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test
+public void testMismatchWhiteList() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ú", "ć", "ž"), 
false);
+List tokens = collectTokens(inputText, whitelisted);
+List expected = Arrays.asList("Senora", "Siobhan", "look", 
"at", "that", "facade");
+assertTrue(tokens.equals(expected));
+}
+
+@Test(expected = NullPointerException.class)
+public void testNullWhiteListThrowsError() throws IOException {
+collectTokens(inputText, null);
+}
+
+@Test
+public void testEmptyInput() throws IOException {
+whitelisted = new CharArraySet(Arrays.asList("ç"), false);
+inputText = new StringReader("");
+List tokens = 

[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435515#comment-16435515
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Used `luke` to look at the Lucene index created, and everything checked. 
Had a bit of struggle with the queries, but it was my mistake. 


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435512#comment-16435512
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
As it is possible to see, that as the configuration white-lists only `ä`, 
the `ö` is escaped with the filter.


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435509#comment-16435509
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  

![screenshot_2018-04-13_00-51-46](https://user-images.githubusercontent.com/304786/38678650-c0c8cfbe-3eb5-11e8-83f7-72ac846cf661.png)

![screenshot_2018-04-13_00-52-12](https://user-images.githubusercontent.com/304786/38678651-c0fd40aa-3eb5-11e8-95f9-30f36e523089.png)



> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435508#comment-16435508
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Then, started Fuseki in Eclipse (FusekiCmd, with --config /.../fuseki.ttl). 
Loading the data file on to the /ds/ endpoint, everything works as expected. I 
loaded a modified `books.ttl` and got the following:


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435506#comment-16435506
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/395
  
Example configuration used for testing:

```
@prefix :<#> .
@prefix fuseki:   .
@prefix dc:   .
@prefix rdf:  .
@prefix rdfs: .
@prefix tdb:  .
@prefix ja:   .
@prefix text: .
@prefix skos: .

[] ja:loadClass "org.apache.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDBrdfs:subClassOf  ja:Model .

[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset  rdfs:subClassOf   ja:RDFDataset .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

[] rdf:type fuseki:Server ;
   fuseki:services (
 <#service_text_tdb>
   ) .

<#service_text_tdb> rdf:type fuseki:Service ;
rdfs:label  "TDB/text service" ;
fuseki:name "ds" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
fuseki:serviceUpdate"update" ;
fuseki:serviceUpload"upload" ;
fuseki:serviceReadGraphStore"get" ;
fuseki:serviceReadWriteGraphStore"data" ;
fuseki:dataset  :text_dataset ;
.

:text_dataset rdf:type text:TextDataset ;
text:dataset   <#dataset> ;
text:index <#indexLucene> ;
.

<#dataset> rdf:type  tdb:DatasetTDB ;
tdb:location "/tmp/db" ;
tdb:unionDefaultGraph true ; # Optional
.

<#indexLucene> a text:TextIndexLucene ;
text:directory  ;
text:entityMap <#entMap> ;
text:storeValues true ;
text:defineAnalyzers (
  [ 
text:defineAnalyzer <#configuredAnalyzer> ;
text:analyzer [
  a text:ConfigurableAnalyzer ;
  text:tokenizer <#tokenizer> ;
  text:filters ( :selectiveFoldingFilter text:LowerCaseFilter )
]
  ]
  [
text:defineTokenizer <#tokenizer> ;
text:tokenizer [
  a text:GenericTokenizer ;
  text:class "org.apache.lucene.analysis.core.LowerCaseTokenizer" 
]
  ]
  [
text:defineFilter :selectiveFoldingFilter ;
text:filter [
  a text:GenericFilter ;
  text:class 
"org.apache.jena.query.text.filter.SelectiveFoldingFilter" ;
  text:params (
[ 
  text:paramName "whitelisted" ;
  text:paramType text:TypeSet ;
  text:paramValue ("ç" "ä")
]
  )
]
  ]
) ;
text:analyzer [
  a text:DefinedAnalyzer ;
  text:useAnalyzer <#configuredAnalyzer> 
] ;
text:queryAnalyzer [ 
  a text:DefinedAnalyzer ;
  text:useAnalyzer <#configuredAnalyzer> 
] ;
text:queryParser text:AnalyzingQueryParser ;
text:multilingualSupport true ;
 .

<#entMap> a text:EntityMap ;
text:defaultField "pref" ;
text:entityField  "uri" ;
text:uidField "uid" ;
text:langField"lang" ;
text:graphField   "graph" ;
text:map (
 # skos:prefLabel
 [ text:field "pref" ;
   text:predicate skos:prefLabel
 ]
 # skos:altLabel
 [ text:field "alt" ;
   text:predicate skos:altLabel
 ]
 # skos:hiddenLabel
 [ text:field "hidden" ;
   text:predicate skos:hiddenLabel 
 ]
 ) 
 .
```


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it 

[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-04-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435501#comment-16435501
 ] 

ASF GitHub Bot commented on JENA-1488:
--

GitHub user kinow opened a pull request:

https://github.com/apache/jena/pull/395

JENA-1488: add a selective folding analyzer

This PR adds a selective folding analyzer, as explained in JENA-1488.

It takes a list of characters, used as a white list. Everything that is not 
in the white list, gets oassed though the existing ASCIIFoldingFilter.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kinow/jena selective-folding-analyzer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/jena/pull/395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #395


commit de1bd22a58f76bbac41d16cb7111ed85b98279cd
Author: Bruno P. Kinoshita 
Date:   2018-04-09T09:38:14Z

JENA-1488: add a selective folding analyzer




> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-03-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403924#comment-16403924
 ] 

ASF GitHub Bot commented on JENA-1488:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/385
  
+1 @rvesse comments. And +1 for the great PR @xristy !!! Tested with the 
WIP `SelectiveFoldingFilter` for JENA-1488, and it worked like a charm! 
:tada: :tada: :tada: 

First created a normal JenaText configuration, and everything worked with 
no issues. Then replaced my `analyzer` by a `DefinedAnalyzer`... had a few 
hiccups forgetting to create a tokenizer, and had to update the filter to use a 
`CharArraySet` (much better that way I think).

But in the end got the query working as expected. Confirmed with Luke that 
the contents were indexed correctly with the new filter... did a quick search 
in Luke, then used Fuseki, and everything worked as expected.

Once this one is merged, we'll be ready for another PR for JENA-1488. In 
the meantime, I will start working on documentation and tests !

Thanks!
Bruno


> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-03-13 Thread Bruno P. Kinoshita (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396750#comment-16396750
 ] 

Bruno P. Kinoshita commented on JENA-1488:
--

>The DefinedFilter solution sounds like the best from my perspective too.

Agreed! Just re-read [~code-ferret]'s previous comments, then jumped to have a 
look at TextIndexLuceneAssembler/TextIndexLucene. And I believe I'm 
understanding more what he meant. Will wait for his PR to review/test and see 
how the SelectiveFoldingFilter would fit in the solution (I believe I will work 
like a charm!).

>I'd still prefer the SelectiveFoldingFilter to live in the Jena codebase (for 
>reasons of convenience stated above).

Agreed. IIUC, with the DefinedFilter/DefinedTokenizer approach, we will be able 
to use the SelectiveFoldingFilter from my PR, or any other filter/tokenizer 
combination from Lucene :D

 

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-03-12 Thread Osma Suominen (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395391#comment-16395391
 ] 

Osma Suominen commented on JENA-1488:
-

The DefinedFilter solution sounds like the best from my perspective too. I'd 
still prefer the SelectiveFoldingFilter to live in the Jena codebase (for 
reasons of convenience stated above).

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-03-10 Thread Bruno P. Kinoshita (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394326#comment-16394326
 ] 

Bruno P. Kinoshita commented on JENA-1488:
--

[~code-ferret] your alternative sounds like the best one. Perhaps I should have 
dug deeper into the ConfigurableAnalyzer after your first comment, sorry.

>I'm happy to open a separate ticket on this if there is interest. I've 
>sketched above the essence of the assembler syntax. The implementation will 
>use the same framework as for {{GenericAnalyzerAssembler}} and friends, The 
>{{ConfigurableAnalyzer}} will be modified so that the {{getTokenizer}} and 
>{{getTokenizerFilter}} use a {{Hashtable}}, as in {{Utils.java}}, to retrieve 
>the tokenizers and filters by name.

It does sound like a neat solution. I'm +1 for a separate ticket, and of course 
happy to review/test a pull request/patch.

>What parameter types are need for the {{SelectiveFoldingFilter}}?

Just a java.util.List, but if necessary we can use a 
String/CharSequence/etc and build the list of chars out of it. This list is 
used as a white-list of characters that are not folded.

Thanks!!!

 

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-03-10 Thread Code Ferret (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394281#comment-16394281
 ] 

Code Ferret commented on JENA-1488:
---

[Bruno P. 
Kinoshita|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=kinow], 
as mentioned earlier, adding {{DefinedFilter}} and {{DefinedTokenizer}} that 
work seamlessly without any backwards compatibility issues with the 
{{ConfigurableAnalyzer}} is quite straightforward. The 
{{SelectiveFoldingFilter}} can be easily added to Jena as a _built-in_ filter 
that can be configured as needed.

I'm happy to open a separate ticket on this if there is interest. I've sketched 
above the essence of the assembler syntax. The implementation will use the same 
framework as for {{GenericAnalyzerAssembler}} and friends, The 
{{ConfigurableAnalyzer}} will be modified so that the {{getTokenizer}} and 
{{getTokenizerFilter}} use a {{Hashtable}}, as in {{Utils.java}}, to retrieve 
the tokenizers and filters by name.

What parameter types are need for the {{SelectiveFoldingFilter}}?

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-03-10 Thread Bruno P. Kinoshita (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394112#comment-16394112
 ] 

Bruno P. Kinoshita commented on JENA-1488:
--

Had a bit more of spare time today, so had a refresh course on Lucene 
analyzers, and also read the code and docs for Jena Text.

Right now we have a filter, that may possibly work for this issue. In order to 
use it from Jena, I believe we have the following options, in no special order:
 * Modify the ConfigurableAnalyzer to support filters with parameters (though I 
think changing the ConfigurableAnalyzer could cause some incompatibility for 
users, and would have to have its own ticket).
 * Add a `setAccessible(true)` to the constructor found via reflection in the 
GenericAnalyzerAssembler, allowing the use of CustomAnalyzer (not quite 
elegant, as we are supposed to use the builder provided by the analyzer, and 
setAccessible may fail in different environments due to security constraints).
 * Create an analyzer that uses the selective folding filter.

Thoughts? Any other alternatives?

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-03-09 Thread Bruno P. Kinoshita (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392865#comment-16392865
 ] 

Bruno P. Kinoshita commented on JENA-1488:
--

Updated my current branch, removing the assembler changes and the analyzer. Now 
it actually holds one single file, 
[/org/apache/jena/query/text/filter/SelectiveFoldingFilter.java|https://github.com/apache/jena/compare/apache:46e2f56...kinow:d90ffa0]

I have not added tests, not squashed commits, removed main method, etc, as the 
code may still need some further massaging.

The output of the main method now would be:
{noformat}
TERM = Senora
TERM = Siobhan
TERM = look
TERM = at
TERM = that
TERM = façade
{noformat}
So the _façade_ keep the cedilla, as it was whitelisted. If the letter 'ñ' was 
added to the white-list, then the first term found would actually be _Señora_. 
After using the white-list, the code delegates it to a method from the existing 
ASCIIFoldingFilter.

Now just need to find a way to rig it up together with Jena text analyzers. I 
liked [~code-ferret], though I am not entirely sure where/how to update the 
ConfigurableAnalyzer. I tried using it, and noticed I couldn't pass the 
white-list when creating an analyzer/filter.

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Assignee: Bruno P. Kinoshita
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-02-15 Thread Code Ferret (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366120#comment-16366120
 ] 

Code Ferret commented on JENA-1488:
---

No problem. An extension feature can be useful sometimes even to provide access 
to built-in components that have arguments not accounted for or that are 
present in Lucene but not poked through to the assembler.

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-02-15 Thread Osma Suominen (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365922#comment-16365922
 ] 

Osma Suominen commented on JENA-1488:
-

[~code-ferret] I wouldn't oppose adding such a facility, but in the case of 
this particular issue/feature, I would prefer adding yet another extra filter 
to the jena-text codebase instead of making it a separate module that has to be 
maintained somewhere, built and added to the classpath every time Fuseki is 
deployed on a server. Of course my reasons are very selfish, but I would prefer 
avoiding the hassle with a separate module and just use a built-in feature.

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1488) SelectiveFoldingFilter for jena-text

2018-02-13 Thread Code Ferret (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362953#comment-16362953
 ] 

Code Ferret commented on JENA-1488:
---

Perhaps adding a new filter, especially one that has configurable arguments 
such as the {{excludeChars}}, is an opportunity to add extensions for defined 
filters and defined tokenizers. I've looked at {{ConfigurableAnalyzer}} and its 
assembler and it should be straightforward.

I would add tokenizer and filter definitions to {{TextIndexLucene}} similar to 
the support for adding analyzers:
{code:java}
text:defineFilters (
[ text:defineFilter <#foo> ; 
  text:filter [ 
a text:GenericFilter ;
text:class "fi.finto.FoldingFilter" ;
text:params (
[ text:paramName "excludeChars" ;
  text:paramType text:TypeString ; 
  text:paramValue "whatevercharstoexclude" ]
)
] ; 
  ]
  )
{code}
{{GenericFilterAssembler}} and {{GenericTokenizerAssmbler}} would make use of 
much of the code in {{GenericAnalyzerAssembler}}. The changes to 
{{ConfigurableAnalyzer}} and {{ConfigurableAnalyzerAssembler}} are 
straightforward and mostly involve retaining the resource URI rather than 
extracting the localName.

Such an addition would make it easy to create new tokenizers and filters that 
could be dropped in by just adding the classes onto the jena/fuseki classpath 
and putting the appropriate assembler bits in the configuration.

If there is interest, I should be able to implement this in a PR rather quickly,

> SelectiveFoldingFilter for jena-text
> 
>
> Key: JENA-1488
> URL: https://issues.apache.org/jira/browse/JENA-1488
> Project: Apache Jena
>  Issue Type: Improvement
>  Components: Text
>Affects Versions: Jena 3.6.0
>Reporter: Osma Suominen
>Priority: Major
>
> Currently there's some support for accent folding in jena-text, because 
> Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search 
> for "deja vu" will match the literal "déjà vu" in the data.
> But we can't use it here at the National Library of Finland (for Finto.fi / 
> Skosmos), because it folds too much! In the Finnish alphabet, in addition to 
> the Latin a-z (which are in ASCII) we use the letters åäö and these should 
> not be folded to ASCII. So we need a Lucene analyzer that can be configured 
> with an exclude list, something like 
>  
> new SelectiveFoldingFilter(String excludeChars) 
>  
> and that can be also be configured via the Jena assembler just like other 
> analyzers supported by jena-text. 
>  
> This was also briefly discussed on the skosmos-users mailing list: 
> [https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ] 
> Apparently Norwegians have the same problem...
> I've discussed this with [~kinow] and he has some initial code to implement 
> this feature, so I think we can turn this into a PR fairly soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)