[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-07 Thread Bruno P. Kinoshita (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505673#comment-16505673
 ] 

Bruno P. Kinoshita commented on JENA-1556:
--

>Our Chinese analyzer does exactly this with pinyin

Excellent! Let me know if there's a PR or branch somewhere and I will try to 
test with Japanese.

> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the language tag for the query string along with 
> the various fields that need to be considered.
> Supposing that the query is:
> {code:java}
> (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) 
> {code}
> Then the query formed in {{TextIndexLucene}} will be:
> {code:java}
> label_bo:rje label_bo-x-ewts:rje label_bo-alalc97:rje
> {code}
> which is translated using a suitable 

[jira] [Commented] (JENA-1558) Ensure that shading the Guava dependency does not transitively include Guava.

2018-06-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505338#comment-16505338
 ] 

ASF GitHub Bot commented on JENA-1558:
--

Github user afs commented on the issue:

https://github.com/apache/jena/pull/428
  
 JENA-1558 / PR #429.

`true`seems to work - I've added the exclusions as well so that 
`dependency:tree` run in `jena-shared-guava` looks right.  The test suite 
passes. If it looks decent I can push it into the development code to make it 
easier to try out for OSGi.

Putting all the OSGi changes in one is PR is fine by me - whatever is 
easier for you.


> Ensure that shading the Guava dependency does not transitively include Guava.
> -
>
> Key: JENA-1558
> URL: https://issues.apache.org/jira/browse/JENA-1558
> Project: Apache Jena
>  Issue Type: Task
>Reporter: Andy Seaborne
>Assignee: Andy Seaborne
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] jena issue #428: Update OSGi imports

2018-06-07 Thread afs
Github user afs commented on the issue:

https://github.com/apache/jena/pull/428
  
 JENA-1558 / PR #429.

`true`seems to work - I've added the exclusions as well so that 
`dependency:tree` run in `jena-shared-guava` looks right.  The test suite 
passes. If it looks decent I can push it into the development code to make it 
easier to try out for OSGi.

Putting all the OSGi changes in one is PR is fine by me - whatever is 
easier for you.


---


[GitHub] jena pull request #429: JENA-1558: Make Guava optional and add exclusions

2018-06-07 Thread afs
GitHub user afs opened a pull request:

https://github.com/apache/jena/pull/429

JENA-1558: Make Guava optional and add exclusions



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/afs/jena guava

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/jena/pull/429.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #429


commit 2459b077714cc36b7dec780bb1907f2af0ba73a7
Author: Andy Seaborne 
Date:   2018-06-07T20:42:19Z

JENA-1558: Make Guava optional and add exclusions




---


[jira] [Assigned] (JENA-1558) Ensure that shading the Guava dependency does not transitively include Guava.

2018-06-07 Thread Andy Seaborne (JIRA)


 [ 
https://issues.apache.org/jira/browse/JENA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Seaborne reassigned JENA-1558:
---

Assignee: Andy Seaborne

> Ensure that shading the Guava dependency does not transitively include Guava.
> -
>
> Key: JENA-1558
> URL: https://issues.apache.org/jira/browse/JENA-1558
> Project: Apache Jena
>  Issue Type: Task
>Reporter: Andy Seaborne
>Assignee: Andy Seaborne
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (JENA-1558) Ensure that shading the Guava dependency does not transitively include Guava.

2018-06-07 Thread Andy Seaborne (JIRA)
Andy Seaborne created JENA-1558:
---

 Summary: Ensure that shading the Guava dependency does not 
transitively include Guava.
 Key: JENA-1558
 URL: https://issues.apache.org/jira/browse/JENA-1558
 Project: Apache Jena
  Issue Type: Task
Reporter: Andy Seaborne






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] jena issue #428: Update OSGi imports

2018-06-07 Thread acoburn
Github user acoburn commented on the issue:

https://github.com/apache/jena/pull/428
  
Explicitly marking the dependency as optional might work.

If that doesn't work, the import declaration in `jena-osgi` for 
`com.google.errorprone.annotations` is in the same category as 
`org.checkerframework.checker`: they are both related to the transitive 
dependencies in Guava. Marking them both as optional would work 
(`resolution:=optional`), as would excluding them entirely 
(`!org.checkerframework...`). It would make sense to treat them in the same way.

You are correct, the `xerces` feature can be entirely removed. Once we 
figure out the best approach here, I can make that change as part of this PR -- 
unless you'd like it to be handled separately.


---


[GitHub] jena issue #428: Update OSGi imports

2018-06-07 Thread afs
Github user afs commented on the issue:

https://github.com/apache/jena/pull/428
  
We can exclude for jena shading - it makes the `dependency:tree` cleaner 
but (I have tried) does not affect the shaded jar because the jar is filtered 
by an include.

We could try making the dependency "optional" so it is not picked up 
transitively (I have not tried yet). Would that help jena-osgi, if it works?  
Jena and jsonld-java do not have a real dependency of the guava jar.

We can add the same excludes to the dependency in jena-arq that brings in 
jsonld-java and PR that and/or the  upstream when we have it all 
proven.

For `jena-osgi`, do we need 
`` can be removed.



---


[GitHub] jena issue #428: Update OSGi imports

2018-06-07 Thread acoburn
Github user acoburn commented on the issue:

https://github.com/apache/jena/pull/428
  
I am definitely no expert on shaded jars. The above comment was based on 
running `mvn dependency:tree` and seeing these compile-scoped dependencies:

```
[INFO] +- org.apache.jena:jena-shaded-guava:jar:3.8.0-SNAPSHOT:compile
[INFO] |  \- com.google.guava:guava:jar:24.1-jre:compile
[INFO] | +- com.google.code.findbugs:jsr305:jar:1.3.9:compile
[INFO] | +- org.checkerframework:checker-compat-qual:jar:2.0.0:compile
[INFO] | +- 
com.google.errorprone:error_prone_annotations:jar:2.1.3:compile
[INFO] | +- com.google.j2objc:j2objc-annotations:jar:1.1:compile
[INFO] | \- 
org.codehaus.mojo:animal-sniffer-annotations:jar:1.14:compile
```

Also, when trying to provision `jena-osgi` in Karaf, those dependent 
modules are needed unless they are explicitly excluded (as in this PR).

As for the jsonld dependency, the OSGi manifest headers include:

```
[IMPEXP]
Import-Package
  ...
  com.google.common.cache{version=[24.1,25)}
  com.google.common.collect  {version=[24.1,25)}
```

so perhaps that is something to bring up with the json-ld project.


---


[GitHub] jena issue #428: Update OSGi imports

2018-06-07 Thread afs
Github user afs commented on the issue:

https://github.com/apache/jena/pull/428
  
We can, and for clarity should, exclude them but when I look in the jena 
shaded jar, I don't see them. There is a filter-includes in Jena to pick only 
`com.google.common` and `com.google.thirdparty`. Maybe we don't even need 
"thirdparty".

In the jsonld-java 0.12.0 jar, I only see 
`com.github.jsonldjava.shaded.com.google.common`.



---


[GitHub] jena issue #428: Update OSGi imports

2018-06-07 Thread acoburn
Github user acoburn commented on the issue:

https://github.com/apache/jena/pull/428
  
I believe the Guava dependencies are brought in as part of the shading 
process. Those dependencies are not typically required for Guava itself (they 
are only test-scoped dependencies). So perhaps there is something that can be 
done in the `jena-shaded-guava` project to exclude those at that point.


---


[GitHub] jena issue #428: Update OSGi imports

2018-06-07 Thread afs
Github user afs commented on the issue:

https://github.com/apache/jena/pull/428
  
We ought to exclude all the tools that Guava adds as it's own dependencies 
(it would be nice if they were marked optional).


---


[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-07 Thread Code Ferret (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504686#comment-16504686
 ] 

Code Ferret commented on JENA-1556:
---

[~kinow] Ah, I sort of thought that you were referring to the filter work from 
a few months ago but wasn't sure.
{quote}From what I understood from your initial description, I could index 
Japanese terms, and find a way to search with roman letters in UTF-8, and match 
documents that originally had Japanese in some different encoding?
{quote}
Our Chinese analyzer does exactly this with pinyin. The incoming hanzi encoded 
string is transcoded, tokenized to zh-latn-pinyin, and indexed so that when 
searches are performed with pinyin encoded query strings the original hanzi 
fields will be retrieved. Of course in typical usage since the pinyin is 
one-to-many there will be many homophones in the result set.

> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields 

[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-07 Thread Bruno P. Kinoshita (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504376#comment-16504376
 ] 

Bruno P. Kinoshita commented on JENA-1556:
--

[~code-ferret] selective folding filter, built on top of the pull request you 
submitted for defineAnalyzers 
(http://jena.staging.apache.org/documentation/query/text-query.html). I can 
just replace some symbols/chars during index & search to work around Maori.

You are absolutely right about Katakana, Hiragana, and also Kanji. But the 
problem is when users query for terms in roman alphabet. Instead of writing in 
Japanese, it's common for some systems to support the roman letters (romaji). 
Jisho is a good example, though contrived as it's a dictionary.

Second result for kino written in romaji matches the word for yesterday: 
https://jisho.org/search/kino

Searching for the hepburn romaji (common in documents and websites) doesn't 
(matches only an example): https://jisho.org/search/kin%C5%8D

And, obviously, searching for the kanji/hiragana will return the expected 
results: https://jisho.org/search/%E6%98%A8%E6%97%A5

>From what I understood from your initial description, I could index Japanese 
>terms, and find a way to search with roman letters in UTF-8, and match 
>documents that originally had Japanese in some different encoding?

> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
>