[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517628#comment-16517628
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user xristy commented on the issue:

https://github.com/apache/jena/pull/436
  
@kinow Thank you very much for the pointers. I think I'm closer. 

I added a remote apache-origin (should have called it jena-origin I 
suppose).

Then did:

git push apache-origin

and got encouraging result:

Counting objects: 3316, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (1017/1017), done.
Writing objects: 100% (3316/3316), 795.73 KiB | 99.47 MiB/s, done.
Total 3316 (delta 1505), reused 3218 (delta 1453)
remote: jena git commit: added auxIndex unit test
remote: jena git commit: added searchFor unit tests
remote: jena git commit: various cleanup per @kinow
remote: jena git commit: cleanup per comments from afs
remote: jena git commit: JENA-1556 implementation
To https://git-wip-us.apache.org/repos/asf/jena.git
 * [new branch]JENA-1556-MutilingualEnhancements-3.8.0 -> 
JENA-1556-MutilingualEnhancements-3.8.0

unfortunately I don't know how to merge the new branch into `master`. This 
is my first time working directly with the Apache git repo.


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName 

[GitHub] jena issue #436: JENA-1556 implementation

2018-06-19 Thread xristy
Github user xristy commented on the issue:

https://github.com/apache/jena/pull/436
  
@kinow Thank you very much for the pointers. I think I'm closer. 

I added a remote apache-origin (should have called it jena-origin I 
suppose).

Then did:

git push apache-origin

and got encouraging result:

Counting objects: 3316, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (1017/1017), done.
Writing objects: 100% (3316/3316), 795.73 KiB | 99.47 MiB/s, done.
Total 3316 (delta 1505), reused 3218 (delta 1453)
remote: jena git commit: added auxIndex unit test
remote: jena git commit: added searchFor unit tests
remote: jena git commit: various cleanup per @kinow
remote: jena git commit: cleanup per comments from afs
remote: jena git commit: JENA-1556 implementation
To https://git-wip-us.apache.org/repos/asf/jena.git
 * [new branch]JENA-1556-MutilingualEnhancements-3.8.0 -> 
JENA-1556-MutilingualEnhancements-3.8.0

unfortunately I don't know how to merge the new branch into `master`. This 
is my first time working directly with the Apache git repo.


---


[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517601#comment-16517601
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/436
  
@xristy if you'd like to merge it, and you cannot use the GitHub interface, 
then you can

* Add a remote in your git repository using the developerConnection URL 
from the pom.xml (https://github.com/apache/jena/blob/master/pom.xml) but 
without the scm:git part (i.e. https://git-wip-us.apache.org/repos/asf/jena.git)
* Merge the work and push to master
* If the commit tree is OK, GitHub should understand this branch was 
merged. But if you changed or didn't rebase the code, you can still close it 
manually, or amend a commit and add somewhere in the message "This closes 
pr#436"

HTH


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all 

[GitHub] jena issue #436: JENA-1556 implementation

2018-06-19 Thread kinow
Github user kinow commented on the issue:

https://github.com/apache/jena/pull/436
  
@xristy if you'd like to merge it, and you cannot use the GitHub interface, 
then you can

* Add a remote in your git repository using the developerConnection URL 
from the pom.xml (https://github.com/apache/jena/blob/master/pom.xml) but 
without the scm:git part (i.e. https://git-wip-us.apache.org/repos/asf/jena.git)
* Merge the work and push to master
* If the commit tree is OK, GitHub should understand this branch was 
merged. But if you changed or didn't rebase the code, you can still close it 
manually, or amend a commit and add somewhere in the message "This closes 
pr#436"

HTH


---


[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517544#comment-16517544
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user xristy commented on the issue:

https://github.com/apache/jena/pull/436
  
@afs I don't have write access to this github mirror repo and don't 
immediately see how work in the `git-wip-us.apache.org` so I'm not sure how to 
proceed with merging


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the language tag for the query string along with 
> the various fields that need to be considered.
> Supposing that the query is:
> {code:java}
> (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) 
> {code}
> Then the query formed in {{TextIndexLucene}} will be:
> {code:java}
> 

[GitHub] jena issue #436: JENA-1556 implementation

2018-06-19 Thread xristy
Github user xristy commented on the issue:

https://github.com/apache/jena/pull/436
  
@afs I don't have write access to this github mirror repo and don't 
immediately see how work in the `git-wip-us.apache.org` so I'm not sure how to 
proceed with merging


---


Re: CMS diff: TDB Datasets

2018-06-19 Thread Andy Seaborne

Greg,

Could you create a JIRA ticket for this please?  It is something that 
looks addressable.  The solution proposed (using union graph) is a bit 
specialised.


Andy

The query may be better if written (but the "..." may be making a 
difference.)


 GRAPH dataset:SmallB {
   ?b rdf:type my:BThing.
   ?b my:hasData ?bData.
  FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral))
}

GRAPH dataset:BigA {
  ?a rdf:type my:AThing.
  ?a noa:hasGeometry ?aData.
}
FILTER(my:filterFunction1(?bData, ?aData))



On 19/06/18 10:59, Greg Albiston wrote:

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Ftdb%2Fdatasets.mdtext

Greg Albiston

Index: trunk/content/documentation/tdb/datasets.mdtext
===
--- trunk/content/documentation/tdb/datasets.mdtext (revision 1833775)
+++ trunk/content/documentation/tdb/datasets.mdtext (working copy)
@@ -51,6 +51,51 @@
  ...
  }
  
+### Named Graphs & Filters

+
+Named graphs provide a convenient way to organise and store your data.
+However, be aware that in certain situations named graphs can make it 
difficult for the query optimiser.
+
+For example, a query with the following structure took 29 minutes to complete:
+
+SELECT ?b ...
+WHERE {
+
+GRAPH dataset:BigA {
+?a rdf:type my:AThing.
+?a noa:hasGeometry ?aData.
+...
+}
+   
+GRAPH dataset:SmallB {
+?b rdf:type my:BThing.
+?b my:hasData ?bData.
+...
+}
+
+FILTER(my:filterFunction1(?bData, ?aData))
+FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral) )
+
+}
+
+The completion duration was reduced to 7 seconds by applying the global 
TDB.symUnionDefaultGraph option (see above) to the dataset and modifying the 
query as follows:
+
+SELECT ?b ...
+WHERE {
+
+?a rdf:type my:AThing.
+?a noa:hasGeometry ?aData.
+...
+
+?b rdf:type my:BThing.
+?b my:hasData ?bData.
+...
+
+FILTER(my:filterFunction1(?bData, ?aData))
+FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral) )
+
+}
+
  ## Special Graph Names
  
  URI | Meaning




[jira] [Resolved] (JENA-1564) StageGeneratorGeneric does not apply BGP reordering properly

2018-06-19 Thread Andy Seaborne (JIRA)


 [ 
https://issues.apache.org/jira/browse/JENA-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Seaborne resolved JENA-1564.
-
Resolution: Fixed

> StageGeneratorGeneric does not apply BGP reordering properly
> 
>
> Key: JENA-1564
> URL: https://issues.apache.org/jira/browse/JENA-1564
> Project: Apache Jena
>  Issue Type: Bug
>Affects Versions: Jena 3.7.0
>Reporter: Andy Seaborne
>Assignee: Andy Seaborne
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> The reorder steps:
> {noformat}
> ReorderProc reorderProc = reorder.reorderIndexes(bgp2) ;
> pattern = reorderProc.reorder(pattern) ;
> {noformat}
> are inside {{if ( ! input.isJoinIdentity() )}}
> but reordering should happen anyway.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


CMS diff: TDB Datasets

2018-06-19 Thread Greg Albiston
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Ftdb%2Fdatasets.mdtext

Greg Albiston

Index: trunk/content/documentation/tdb/datasets.mdtext
===
--- trunk/content/documentation/tdb/datasets.mdtext (revision 1833775)
+++ trunk/content/documentation/tdb/datasets.mdtext (working copy)
@@ -51,6 +51,51 @@
 ...
 }
 
+### Named Graphs & Filters
+
+Named graphs provide a convenient way to organise and store your data. 
+However, be aware that in certain situations named graphs can make it 
difficult for the query optimiser.
+
+For example, a query with the following structure took 29 minutes to complete:
+
+SELECT ?b ...
+WHERE {
+
+GRAPH dataset:BigA {
+?a rdf:type my:AThing.
+?a noa:hasGeometry ?aData.
+...
+}
+   
+GRAPH dataset:SmallB {
+?b rdf:type my:BThing.
+?b my:hasData ?bData.
+...
+}
+
+FILTER(my:filterFunction1(?bData, ?aData))
+FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral) )
+
+}
+ 
+The completion duration was reduced to 7 seconds by applying the global 
TDB.symUnionDefaultGraph option (see above) to the dataset and modifying the 
query as follows:
+
+SELECT ?b ...
+WHERE {
+
+?a rdf:type my:AThing.
+?a noa:hasGeometry ?aData.
+...
+
+?b rdf:type my:BThing.
+?b my:hasData ?bData.
+...
+
+FILTER(my:filterFunction1(?bData, ?aData))
+FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral) )
+
+}
+
 ## Special Graph Names
 
 URI | Meaning



[jira] [Commented] (JENA-1564) StageGeneratorGeneric does not apply BGP reordering properly

2018-06-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516897#comment-16516897
 ] 

ASF GitHub Bot commented on JENA-1564:
--

Github user asfgit closed the pull request at:

https://github.com/apache/jena/pull/437


> StageGeneratorGeneric does not apply BGP reordering properly
> 
>
> Key: JENA-1564
> URL: https://issues.apache.org/jira/browse/JENA-1564
> Project: Apache Jena
>  Issue Type: Bug
>Affects Versions: Jena 3.7.0
>Reporter: Andy Seaborne
>Assignee: Andy Seaborne
>Priority: Major
> Fix For: Jena 3.8.0
>
>
> The reorder steps:
> {noformat}
> ReorderProc reorderProc = reorder.reorderIndexes(bgp2) ;
> pattern = reorderProc.reorder(pattern) ;
> {noformat}
> are inside {{if ( ! input.isJoinIdentity() )}}
> but reordering should happen anyway.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)