[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517628#comment-16517628 ] ASF GitHub Bot commented on JENA-1556: -- Github user xristy commented on the issue: https://github.com/apache/jena/pull/436 @kinow Thank you very much for the pointers. I think I'm closer. I added a remote apache-origin (should have called it jena-origin I suppose). Then did: git push apache-origin and got encouraging result: Counting objects: 3316, done. Delta compression using up to 8 threads. Compressing objects: 100% (1017/1017), done. Writing objects: 100% (3316/3316), 795.73 KiB | 99.47 MiB/s, done. Total 3316 (delta 1505), reused 3218 (delta 1453) remote: jena git commit: added auxIndex unit test remote: jena git commit: added searchFor unit tests remote: jena git commit: various cleanup per @kinow remote: jena git commit: cleanup per comments from afs remote: jena git commit: JENA-1556 implementation To https://git-wip-us.apache.org/repos/asf/jena.git * [new branch]JENA-1556-MutilingualEnhancements-3.8.0 -> JENA-1556-MutilingualEnhancements-3.8.0 unfortunately I don't know how to merge the new branch into `master`. This is my first time working directly with the Apache git repo. > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName
[GitHub] jena issue #436: JENA-1556 implementation
Github user xristy commented on the issue: https://github.com/apache/jena/pull/436 @kinow Thank you very much for the pointers. I think I'm closer. I added a remote apache-origin (should have called it jena-origin I suppose). Then did: git push apache-origin and got encouraging result: Counting objects: 3316, done. Delta compression using up to 8 threads. Compressing objects: 100% (1017/1017), done. Writing objects: 100% (3316/3316), 795.73 KiB | 99.47 MiB/s, done. Total 3316 (delta 1505), reused 3218 (delta 1453) remote: jena git commit: added auxIndex unit test remote: jena git commit: added searchFor unit tests remote: jena git commit: various cleanup per @kinow remote: jena git commit: cleanup per comments from afs remote: jena git commit: JENA-1556 implementation To https://git-wip-us.apache.org/repos/asf/jena.git * [new branch]JENA-1556-MutilingualEnhancements-3.8.0 -> JENA-1556-MutilingualEnhancements-3.8.0 unfortunately I don't know how to merge the new branch into `master`. This is my first time working directly with the Apache git repo. ---
[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517601#comment-16517601 ] ASF GitHub Bot commented on JENA-1556: -- Github user kinow commented on the issue: https://github.com/apache/jena/pull/436 @xristy if you'd like to merge it, and you cannot use the GitHub interface, then you can * Add a remote in your git repository using the developerConnection URL from the pom.xml (https://github.com/apache/jena/blob/master/pom.xml) but without the scm:git part (i.e. https://git-wip-us.apache.org/repos/asf/jena.git) * Merge the work and push to master * If the commit tree is OK, GitHub should understand this branch was merged. But if you changed or didn't rebase the code, you can still close it manually, or amend a commit and add somewhere in the message "This closes pr#436" HTH > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all
[GitHub] jena issue #436: JENA-1556 implementation
Github user kinow commented on the issue: https://github.com/apache/jena/pull/436 @xristy if you'd like to merge it, and you cannot use the GitHub interface, then you can * Add a remote in your git repository using the developerConnection URL from the pom.xml (https://github.com/apache/jena/blob/master/pom.xml) but without the scm:git part (i.e. https://git-wip-us.apache.org/repos/asf/jena.git) * Merge the work and push to master * If the commit tree is OK, GitHub should understand this branch was merged. But if you changed or didn't rebase the code, you can still close it manually, or amend a commit and add somewhere in the message "This closes pr#436" HTH ---
[jira] [Commented] (JENA-1556) text:query multilingual enhancements
[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517544#comment-16517544 ] ASF GitHub Bot commented on JENA-1556: -- Github user xristy commented on the issue: https://github.com/apache/jena/pull/436 @afs I don't have write access to this github mirror repo and don't immediately see how work in the `git-wip-us.apache.org` so I'm not sure how to proceed with merging > text:query multilingual enhancements > > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text >Affects Versions: Jena 3.7.0 >Reporter: Code Ferret >Assignee: Code Ferret >Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all fields with these language tags will have a common set of > indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query > analyzer to have access to the language tag for the query string along with > the various fields that need to be considered. > Supposing that the query is: > {code:java} > (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) > {code} > Then the query formed in {{TextIndexLucene}} will be: > {code:java} >
[GitHub] jena issue #436: JENA-1556 implementation
Github user xristy commented on the issue: https://github.com/apache/jena/pull/436 @afs I don't have write access to this github mirror repo and don't immediately see how work in the `git-wip-us.apache.org` so I'm not sure how to proceed with merging ---
Re: CMS diff: TDB Datasets
Greg, Could you create a JIRA ticket for this please? It is something that looks addressable. The solution proposed (using union graph) is a bit specialised. Andy The query may be better if written (but the "..." may be making a difference.) GRAPH dataset:SmallB { ?b rdf:type my:BThing. ?b my:hasData ?bData. FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral)) } GRAPH dataset:BigA { ?a rdf:type my:AThing. ?a noa:hasGeometry ?aData. } FILTER(my:filterFunction1(?bData, ?aData)) On 19/06/18 10:59, Greg Albiston wrote: Clone URL (Committers only): https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Ftdb%2Fdatasets.mdtext Greg Albiston Index: trunk/content/documentation/tdb/datasets.mdtext === --- trunk/content/documentation/tdb/datasets.mdtext (revision 1833775) +++ trunk/content/documentation/tdb/datasets.mdtext (working copy) @@ -51,6 +51,51 @@ ... } +### Named Graphs & Filters + +Named graphs provide a convenient way to organise and store your data. +However, be aware that in certain situations named graphs can make it difficult for the query optimiser. + +For example, a query with the following structure took 29 minutes to complete: + +SELECT ?b ... +WHERE { + +GRAPH dataset:BigA { +?a rdf:type my:AThing. +?a noa:hasGeometry ?aData. +... +} + +GRAPH dataset:SmallB { +?b rdf:type my:BThing. +?b my:hasData ?bData. +... +} + +FILTER(my:filterFunction1(?bData, ?aData)) +FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral) ) + +} + +The completion duration was reduced to 7 seconds by applying the global TDB.symUnionDefaultGraph option (see above) to the dataset and modifying the query as follows: + +SELECT ?b ... +WHERE { + +?a rdf:type my:AThing. +?a noa:hasGeometry ?aData. +... + +?b rdf:type my:BThing. +?b my:hasData ?bData. +... + +FILTER(my:filterFunction1(?bData, ?aData)) +FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral) ) + +} + ## Special Graph Names URI | Meaning
[jira] [Resolved] (JENA-1564) StageGeneratorGeneric does not apply BGP reordering properly
[ https://issues.apache.org/jira/browse/JENA-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Seaborne resolved JENA-1564. - Resolution: Fixed > StageGeneratorGeneric does not apply BGP reordering properly > > > Key: JENA-1564 > URL: https://issues.apache.org/jira/browse/JENA-1564 > Project: Apache Jena > Issue Type: Bug >Affects Versions: Jena 3.7.0 >Reporter: Andy Seaborne >Assignee: Andy Seaborne >Priority: Major > Fix For: Jena 3.8.0 > > > The reorder steps: > {noformat} > ReorderProc reorderProc = reorder.reorderIndexes(bgp2) ; > pattern = reorderProc.reorder(pattern) ; > {noformat} > are inside {{if ( ! input.isJoinIdentity() )}} > but reordering should happen anyway. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
CMS diff: TDB Datasets
Clone URL (Committers only): https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Ftdb%2Fdatasets.mdtext Greg Albiston Index: trunk/content/documentation/tdb/datasets.mdtext === --- trunk/content/documentation/tdb/datasets.mdtext (revision 1833775) +++ trunk/content/documentation/tdb/datasets.mdtext (working copy) @@ -51,6 +51,51 @@ ... } +### Named Graphs & Filters + +Named graphs provide a convenient way to organise and store your data. +However, be aware that in certain situations named graphs can make it difficult for the query optimiser. + +For example, a query with the following structure took 29 minutes to complete: + +SELECT ?b ... +WHERE { + +GRAPH dataset:BigA { +?a rdf:type my:AThing. +?a noa:hasGeometry ?aData. +... +} + +GRAPH dataset:SmallB { +?b rdf:type my:BThing. +?b my:hasData ?bData. +... +} + +FILTER(my:filterFunction1(?bData, ?aData)) +FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral) ) + +} + +The completion duration was reduced to 7 seconds by applying the global TDB.symUnionDefaultGraph option (see above) to the dataset and modifying the query as follows: + +SELECT ?b ... +WHERE { + +?a rdf:type my:AThing. +?a noa:hasGeometry ?aData. +... + +?b rdf:type my:BThing. +?b my:hasData ?bData. +... + +FILTER(my:filterFunction1(?bData, ?aData)) +FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral) ) + +} + ## Special Graph Names URI | Meaning
[jira] [Commented] (JENA-1564) StageGeneratorGeneric does not apply BGP reordering properly
[ https://issues.apache.org/jira/browse/JENA-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516897#comment-16516897 ] ASF GitHub Bot commented on JENA-1564: -- Github user asfgit closed the pull request at: https://github.com/apache/jena/pull/437 > StageGeneratorGeneric does not apply BGP reordering properly > > > Key: JENA-1564 > URL: https://issues.apache.org/jira/browse/JENA-1564 > Project: Apache Jena > Issue Type: Bug >Affects Versions: Jena 3.7.0 >Reporter: Andy Seaborne >Assignee: Andy Seaborne >Priority: Major > Fix For: Jena 3.8.0 > > > The reorder steps: > {noformat} > ReorderProc reorderProc = reorder.reorderIndexes(bgp2) ; > pattern = reorderProc.reorder(pattern) ; > {noformat} > are inside {{if ( ! input.isJoinIdentity() )}} > but reordering should happen anyway. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)