[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518605#comment-16518605 ] ASF GitHub Bot commented on JENA-1556: -- Github user xristy commented on the issue: https://github.com/apache/jena/pull/436 Thanks everyone. I seemed to have finally merged PR#436. I ended up using @rvesse plus a bit from the commit workflow (which I agree should be linked on the Contributing page): > git checkout master > git pull apache master > git merge Jena-1556-MutilingualEnhancements-3.8.0 > mvn clean install -Pdev > git push apache master I put the proper defns in the jena/.git/config for the remotes > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all fields with these language tags will have a common set of > indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query > analyzer to have access to the
[GitHub] jena issue #436: JENA-1556 implementation
Github user xristy commented on the issue: https://github.com/apache/jena/pull/436 Thanks everyone. I seemed to have finally merged PR#436. I ended up using @rvesse plus a bit from the commit workflow (which I agree should be linked on the Contributing page): > git checkout master > git pull apache master > git merge Jena-1556-MutilingualEnhancements-3.8.0 > mvn clean install -Pdev > git push apache master I put the proper defns in the jena/.git/config for the remotes ---
[GitHub] jena pull request #436: JENA-1556 implementation
Github user asfgit closed the pull request at: https://github.com/apache/jena/pull/436 ---
[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518591#comment-16518591 ] ASF GitHub Bot commented on JENA-1556: -- Github user asfgit closed the pull request at: https://github.com/apache/jena/pull/436 > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all fields with these language tags will have a common set of > indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query > analyzer to have access to the language tag for the query string along with > the various fields that need to be considered. > Supposing that the query is: > {code:java} > (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) > {code} > Then the query formed in {{TextIndexLucene}} will be: > {code:java} > label_bo:rje label_bo-x-ewts:rje label_bo-alalc97:rje > {code} > which is translated using a suitable {{Analyzer}}, > {{QueryMultilingualAnalyzer}}, via Lucene's {{QueryParser}}
Re: Commit workflow
As for gitbox, the last result of my investigation was that we won't _currently_ have the same JIRA integrations that we now have, because INFRA hasn't done that yet. (And they would love to have some help doing that!) I left the matter there. For my money, we might do well to revisit the JIRA integrations and make the switch... ajs6f > On Jun 20, 2018, at 9:52 AM, Aaron Coburn wrote: > > Thanks Andy, this is very helpful. > > Best, > Aaron > >> On Jun 20, 2018, at 6:02 AM, Andy Seaborne wrote: >> >> Aaron, Chris, all, >> >> One thing to mention - with the master repo on Apache hardware and PRs on >> the mirror (which we don't have write access to) there is a workflow for >> merging into the Apache github repo: >> >> https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF >> >> Andy >> >> (gitbox?) >
Re: Commit workflow
Thanks Andy, this is very helpful. Best, Aaron > On Jun 20, 2018, at 6:02 AM, Andy Seaborne wrote: > > Aaron, Chris, all, > > One thing to mention - with the master repo on Apache hardware and PRs on the > mirror (which we don't have write access to) there is a workflow for merging > into the Apache github repo: > > https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF > >Andy > > (gitbox?)
[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518005#comment-16518005 ] ASF GitHub Bot commented on JENA-1556: -- Github user kinow commented on the issue: https://github.com/apache/jena/pull/436 The docs look really good! Only feedback I have: * It's possible to avoid the empty commit, by amending the last commit adding "this closes #123". That's what I normally do when I work in other projects. * Should we also merge + --no-ff, or is fetch/rebase all right too? I've done it a few times, but happy to do merge + --no-ff instead if desirable * Perhaps a link in the website for it, under Contributing? Or a new page? It's probably related to http://jena.apache.org/getting_involved/reviewing_contributions.html too? (or maybe there's already a link somewhere?) Thanks > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all fields with
[GitHub] jena issue #436: JENA-1556 implementation
Github user kinow commented on the issue: https://github.com/apache/jena/pull/436 The docs look really good! Only feedback I have: * It's possible to avoid the empty commit, by amending the last commit adding "this closes #123". That's what I normally do when I work in other projects. * Should we also merge + --no-ff, or is fetch/rebase all right too? I've done it a few times, but happy to do merge + --no-ff instead if desirable * Perhaps a link in the website for it, under Contributing? Or a new page? It's probably related to http://jena.apache.org/getting_involved/reviewing_contributions.html too? (or maybe there's already a link somewhere?) Thanks ---
Commit workflow
Aaron, Chris, all, One thing to mention - with the master repo on Apache hardware and PRs on the mirror (which we don't have write access to) there is a workflow for merging into the Apache github repo: https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF Andy (gitbox?)
RE: CMS diff: TDB Datasets
Hi Andy, Thanks for the response. Your suggestion worked and the query completed in a similar time to the union graph approach. I'd tried moving the filter into the graph clause but not swapping the graph order. I added that update on the documentation so if anyone else was having similar problems it might help. Do you still want me to create a JIRA for it? More generally, is there a page/section for tips on query writing to help optimisation? I searched but could only find description of TDB's optimisation functionality and extending query execution. I spent quite a while hunting for tips and trying different ways to influence the resolution order until I thought I'd try the union graph. Thanks, Greg -Original Message- From: Andy Seaborne Sent: 19 June 2018 13:56 To: dev@jena.apache.org; Greg Albiston Subject: Re: CMS diff: TDB Datasets Greg, Could you create a JIRA ticket for this please? It is something that looks addressable. The solution proposed (using union graph) is a bit specialised. Andy The query may be better if written (but the "..." may be making a difference.) GRAPH dataset:SmallB { ?b rdf:type my:BThing. ?b my:hasData ?bData. FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral)) } GRAPH dataset:BigA { ?a rdf:type my:AThing. ?a noa:hasGeometry ?aData. } FILTER(my:filterFunction1(?bData, ?aData)) On 19/06/18 10:59, Greg Albiston wrote: > Clone URL (Committers only): > https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://j > ena.apache.org/documentation%2Ftdb%2Fdatasets.mdtext > > Greg Albiston > > Index: trunk/content/documentation/tdb/datasets.mdtext > === > --- trunk/content/documentation/tdb/datasets.mdtext (revision 1833775) > +++ trunk/content/documentation/tdb/datasets.mdtext (working copy) > @@ -51,6 +51,51 @@ > ... > } > > +### Named Graphs & Filters > + > +Named graphs provide a convenient way to organise and store your data. > +However, be aware that in certain situations named graphs can make it > difficult for the query optimiser. > + > +For example, a query with the following structure took 29 minutes to > complete: > + > +SELECT ?b ... > +WHERE { > + > +GRAPH dataset:BigA { > +?a rdf:type my:AThing. > +?a noa:hasGeometry ?aData. > +... > +} > + > +GRAPH dataset:SmallB { > +?b rdf:type my:BThing. > +?b my:hasData ?bData. > +... > +} > + > +FILTER(my:filterFunction1(?bData, ?aData)) > +FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 > + 2.0"^^my:dataLiteral) ) > + > +} > + > +The completion duration was reduced to 7 seconds by applying the global > TDB.symUnionDefaultGraph option (see above) to the dataset and modifying the > query as follows: > + > +SELECT ?b ... > +WHERE { > + > +?a rdf:type my:AThing. > +?a noa:hasGeometry ?aData. > +... > + > +?b rdf:type my:BThing. > +?b my:hasData ?bData. > +... > + > +FILTER(my:filterFunction1(?bData, ?aData)) > +FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 > + 2.0"^^my:dataLiteral) ) > + > +} > + > ## Special Graph Names > > URI | Meaning >
[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517984#comment-16517984 ] ASF GitHub Bot commented on JENA-1556: -- Github user afs commented on the issue: https://github.com/apache/jena/pull/436 I have a copy of Jena cloned from git-wip-us.apache.org = a added second remote for github (this is so it is called as "github", not the full URL "https://github.com/apache/jena.git;). To merge, I have been using that one, merge from github (--no-ff) into local master and push to origin/master. Is there anything we can do to improve https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF? > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all fields with these language tags will have a common set of > indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query > analyzer to have access to the language tag
[GitHub] jena issue #436: JENA-1556 implementation
Github user afs commented on the issue: https://github.com/apache/jena/pull/436 I have a copy of Jena cloned from git-wip-us.apache.org = a added second remote for github (this is so it is called as "github", not the full URL "https://github.com/apache/jena.git;). To merge, I have been using that one, merge from github (--no-ff) into local master and push to origin/master. Is there anything we can do to improve https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF? ---
[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517973#comment-16517973 ] ASF GitHub Bot commented on JENA-1556: -- Github user afs commented on the issue: https://github.com/apache/jena/pull/436 Committer workflow: https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all fields with these language tags will have a common set of > indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query > analyzer to have access to the language tag for the query string along with > the various fields that need to be considered. > Supposing that the query is: > {code:java} > (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) > {code} > Then the query formed in {{TextIndexLucene}} will be: > {code:java} > label_bo:rje label_bo-x-ewts:rje label_bo-alalc97:rje > {code} > which is
[GitHub] jena issue #436: JENA-1556 implementation
Github user afs commented on the issue: https://github.com/apache/jena/pull/436 Committer workflow: https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF ---
[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517938#comment-16517938 ] ASF GitHub Bot commented on JENA-1556: -- Github user rvesse commented on the issue: https://github.com/apache/jena/pull/436 @xristy I think you can just merge as you would normally though you probably need to be explicit about the remote and branch you are using for remote operations: ``` > git checkout master > git pull apache-origin master > git merge Jena-1556-MutilingualEnhancements-3.8.0 > git push apache-origin master ``` > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all fields with these language tags will have a common set of > indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query > analyzer to have access to the language tag for the query string along with > the various fields that need to be considered. > Supposing that
[GitHub] jena issue #436: JENA-1556 implementation
Github user rvesse commented on the issue: https://github.com/apache/jena/pull/436 @xristy I think you can just merge as you would normally though you probably need to be explicit about the remote and branch you are using for remote operations: ``` > git checkout master > git pull apache-origin master > git merge Jena-1556-MutilingualEnhancements-3.8.0 > git push apache-origin master ``` ---