[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505673#comment-16505673 ] Bruno P. Kinoshita commented on JENA-1556: -- >Our Chinese analyzer does exactly this with pinyin Excellent! Let me know if there's a PR or branch somewhere and I will try to test with Japanese. > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all fields with these language tags will have a common set of > indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query > analyzer to have access to the language tag for the query string along with > the various fields that need to be considered. > Supposing that the query is: > {code:java} > (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) > {code} > Then the query formed in {{TextIndexLucene}} will be: > {code:java} > label_bo:rje label_bo-x-ewts:rje label_bo-alalc97:rje > {code} > which is translated using a suitable
[jira] [Commented] (JENA-1558) Ensure that shading the Guava dependency does not transitively include Guava.
[ https://issues.apache.org/jira/browse/JENA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505338#comment-16505338 ] ASF GitHub Bot commented on JENA-1558: -- Github user afs commented on the issue: https://github.com/apache/jena/pull/428 JENA-1558 / PR #429. `true`seems to work - I've added the exclusions as well so that `dependency:tree` run in `jena-shared-guava` looks right. The test suite passes. If it looks decent I can push it into the development code to make it easier to try out for OSGi. Putting all the OSGi changes in one is PR is fine by me - whatever is easier for you. > Ensure that shading the Guava dependency does not transitively include Guava. > - > > Key: JENA-1558 > URL: https://issues.apache.org/jira/browse/JENA-1558 > Project: Apache Jena > Issue Type: Task >Reporter: Andy Seaborne >Assignee: Andy Seaborne >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] jena issue #428: Update OSGi imports
Github user afs commented on the issue: https://github.com/apache/jena/pull/428 JENA-1558 / PR #429. `true`seems to work - I've added the exclusions as well so that `dependency:tree` run in `jena-shared-guava` looks right. The test suite passes. If it looks decent I can push it into the development code to make it easier to try out for OSGi. Putting all the OSGi changes in one is PR is fine by me - whatever is easier for you. ---
[GitHub] jena pull request #429: JENA-1558: Make Guava optional and add exclusions
GitHub user afs opened a pull request: https://github.com/apache/jena/pull/429 JENA-1558: Make Guava optional and add exclusions You can merge this pull request into a Git repository by running: $ git pull https://github.com/afs/jena guava Alternatively you can review and apply these changes as the patch at: https://github.com/apache/jena/pull/429.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #429 commit 2459b077714cc36b7dec780bb1907f2af0ba73a7 Author: Andy Seaborne Date: 2018-06-07T20:42:19Z JENA-1558: Make Guava optional and add exclusions ---
[jira] [Assigned] (JENA-1558) Ensure that shading the Guava dependency does not transitively include Guava.
[ https://issues.apache.org/jira/browse/JENA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Seaborne reassigned JENA-1558: --- Assignee: Andy Seaborne > Ensure that shading the Guava dependency does not transitively include Guava. > - > > Key: JENA-1558 > URL: https://issues.apache.org/jira/browse/JENA-1558 > Project: Apache Jena > Issue Type: Task >Reporter: Andy Seaborne >Assignee: Andy Seaborne >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (JENA-1558) Ensure that shading the Guava dependency does not transitively include Guava.
Andy Seaborne created JENA-1558: --- Summary: Ensure that shading the Guava dependency does not transitively include Guava. Key: JENA-1558 URL: https://issues.apache.org/jira/browse/JENA-1558 Project: Apache Jena Issue Type: Task Reporter: Andy Seaborne -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] jena issue #428: Update OSGi imports
Github user acoburn commented on the issue: https://github.com/apache/jena/pull/428 Explicitly marking the dependency as optional might work. If that doesn't work, the import declaration in `jena-osgi` for `com.google.errorprone.annotations` is in the same category as `org.checkerframework.checker`: they are both related to the transitive dependencies in Guava. Marking them both as optional would work (`resolution:=optional`), as would excluding them entirely (`!org.checkerframework...`). It would make sense to treat them in the same way. You are correct, the `xerces` feature can be entirely removed. Once we figure out the best approach here, I can make that change as part of this PR -- unless you'd like it to be handled separately. ---
[GitHub] jena issue #428: Update OSGi imports
Github user afs commented on the issue: https://github.com/apache/jena/pull/428 We can exclude for jena shading - it makes the `dependency:tree` cleaner but (I have tried) does not affect the shaded jar because the jar is filtered by an include. We could try making the dependency "optional" so it is not picked up transitively (I have not tried yet). Would that help jena-osgi, if it works? Jena and jsonld-java do not have a real dependency of the guava jar. We can add the same excludes to the dependency in jena-arq that brings in jsonld-java and PR that and/or the upstream when we have it all proven. For `jena-osgi`, do we need `` can be removed. ---
[GitHub] jena issue #428: Update OSGi imports
Github user acoburn commented on the issue: https://github.com/apache/jena/pull/428 I am definitely no expert on shaded jars. The above comment was based on running `mvn dependency:tree` and seeing these compile-scoped dependencies: ``` [INFO] +- org.apache.jena:jena-shaded-guava:jar:3.8.0-SNAPSHOT:compile [INFO] | \- com.google.guava:guava:jar:24.1-jre:compile [INFO] | +- com.google.code.findbugs:jsr305:jar:1.3.9:compile [INFO] | +- org.checkerframework:checker-compat-qual:jar:2.0.0:compile [INFO] | +- com.google.errorprone:error_prone_annotations:jar:2.1.3:compile [INFO] | +- com.google.j2objc:j2objc-annotations:jar:1.1:compile [INFO] | \- org.codehaus.mojo:animal-sniffer-annotations:jar:1.14:compile ``` Also, when trying to provision `jena-osgi` in Karaf, those dependent modules are needed unless they are explicitly excluded (as in this PR). As for the jsonld dependency, the OSGi manifest headers include: ``` [IMPEXP] Import-Package ... com.google.common.cache{version=[24.1,25)} com.google.common.collect {version=[24.1,25)} ``` so perhaps that is something to bring up with the json-ld project. ---
[GitHub] jena issue #428: Update OSGi imports
Github user afs commented on the issue: https://github.com/apache/jena/pull/428 We can, and for clarity should, exclude them but when I look in the jena shaded jar, I don't see them. There is a filter-includes in Jena to pick only `com.google.common` and `com.google.thirdparty`. Maybe we don't even need "thirdparty". In the jsonld-java 0.12.0 jar, I only see `com.github.jsonldjava.shaded.com.google.common`. ---
[GitHub] jena issue #428: Update OSGi imports
Github user acoburn commented on the issue: https://github.com/apache/jena/pull/428 I believe the Guava dependencies are brought in as part of the shading process. Those dependencies are not typically required for Guava itself (they are only test-scoped dependencies). So perhaps there is something that can be done in the `jena-shaded-guava` project to exclude those at that point. ---
[GitHub] jena issue #428: Update OSGi imports
Github user afs commented on the issue: https://github.com/apache/jena/pull/428 We ought to exclude all the tools that Guava adds as it's own dependencies (it would be nice if they were marked optional). ---
[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504686#comment-16504686 ] Code Ferret commented on JENA-1556: --- [~kinow] Ah, I sort of thought that you were referring to the filter work from a few months ago but wasn't sure. {quote}From what I understood from your initial description, I could index Japanese terms, and find a way to search with roman letters in UTF-8, and match documents that originally had Japanese in some different encoding? {quote} Our Chinese analyzer does exactly this with pinyin. The incoming hanzi encoded string is transcoded, tokenized to zh-latn-pinyin, and indexed so that when searches are performed with pinyin encoded query strings the original hanzi fields will be retrieved. Of course in typical usage since the pinyin is one-to-many there will be many homophones in the result set. > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all fields
[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504376#comment-16504376 ] Bruno P. Kinoshita commented on JENA-1556: -- [~code-ferret] selective folding filter, built on top of the pull request you submitted for defineAnalyzers (http://jena.staging.apache.org/documentation/query/text-query.html). I can just replace some symbols/chars during index & search to work around Maori. You are absolutely right about Katakana, Hiragana, and also Kanji. But the problem is when users query for terms in roman alphabet. Instead of writing in Japanese, it's common for some systems to support the roman letters (romaji). Jisho is a good example, though contrived as it's a dictionary. Second result for kino written in romaji matches the word for yesterday: https://jisho.org/search/kino Searching for the hepburn romaji (common in documents and websites) doesn't (matches only an example): https://jisho.org/search/kin%C5%8D And, obviously, searching for the kanji/hiragana will return the expected results: https://jisho.org/search/%E6%98%A8%E6%97%A5 >From what I understood from your initial description, I could index Japanese >terms, and find a way to search with roman letters in UTF-8, and match >documents that originally had Japanese in some different encoding? > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) >