[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518605#comment-16518605
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user xristy commented on the issue:

https://github.com/apache/jena/pull/436
  
Thanks everyone. I seemed to have finally merged PR#436. I ended up using 
@rvesse plus a bit from the commit workflow (which I agree should be linked on 
the Contributing page):

> git checkout master
> git pull apache master
> git merge Jena-1556-MutilingualEnhancements-3.8.0
> mvn clean install -Pdev
> git push apache master

I put the proper defns in the jena/.git/config for the remotes


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the 

[GitHub] jena issue #436: JENA-1556 implementation

2018-06-20 Thread xristy
Github user xristy commented on the issue:

https://github.com/apache/jena/pull/436
  
Thanks everyone. I seemed to have finally merged PR#436. I ended up using 
@rvesse plus a bit from the commit workflow (which I agree should be linked on 
the Contributing page):

> git checkout master
> git pull apache master
> git merge Jena-1556-MutilingualEnhancements-3.8.0
> mvn clean install -Pdev
> git push apache master

I put the proper defns in the jena/.git/config for the remotes


---


[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/jena/pull/436


---


[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518591#comment-16518591
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user asfgit closed the pull request at:

https://github.com/apache/jena/pull/436


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the language tag for the query string along with 
> the various fields that need to be considered.
> Supposing that the query is:
> {code:java}
> (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) 
> {code}
> Then the query formed in {{TextIndexLucene}} will be:
> {code:java}
> label_bo:rje label_bo-x-ewts:rje label_bo-alalc97:rje
> {code}
> which is translated using a suitable {{Analyzer}}, 
> {{QueryMultilingualAnalyzer}}, via Lucene's {{QueryParser}} 

Re: Commit workflow

2018-06-20 Thread ajs6f
As for gitbox, the last result of my investigation was that we won't 
_currently_ have the same JIRA integrations that we now have, because INFRA 
hasn't done that yet. (And they would love to have some help doing that!)

I left the matter there. For my money, we might do well to revisit the JIRA 
integrations and make the switch...

ajs6f

> On Jun 20, 2018, at 9:52 AM, Aaron Coburn  wrote:
> 
> Thanks Andy, this is very helpful.
> 
> Best,
> Aaron
> 
>> On Jun 20, 2018, at 6:02 AM, Andy Seaborne  wrote:
>> 
>> Aaron, Chris, all,
>> 
>> One thing to mention - with the master repo on Apache hardware and PRs on 
>> the mirror (which we don't have write access to) there is a workflow for 
>> merging into the Apache github repo:
>> 
>> https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF
>> 
>>   Andy
>> 
>> (gitbox?)
> 



Re: Commit workflow

2018-06-20 Thread Aaron Coburn
Thanks Andy, this is very helpful.

Best,
Aaron

> On Jun 20, 2018, at 6:02 AM, Andy Seaborne  wrote:
> 
> Aaron, Chris, all,
> 
> One thing to mention - with the master repo on Apache hardware and PRs on the 
> mirror (which we don't have write access to) there is a workflow for merging 
> into the Apache github repo:
> 
> https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF
> 
>Andy
> 
> (gitbox?)



[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518005#comment-16518005
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user kinow commented on the issue:

https://github.com/apache/jena/pull/436
  
The docs look really good! Only feedback I have:

* It's possible to avoid the empty commit, by amending the last commit 
adding "this closes #123". That's what I normally do when I work in other 
projects.
* Should we also merge + --no-ff, or is fetch/rebase all right too? I've 
done it a few times, but happy to do merge + --no-ff instead if desirable 
* Perhaps a link in the website for it, under Contributing? Or a new page? 
It's probably related to 
http://jena.apache.org/getting_involved/reviewing_contributions.html too? (or 
maybe there's already a link somewhere?)

Thanks


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with 

[GitHub] jena issue #436: JENA-1556 implementation

2018-06-20 Thread kinow
Github user kinow commented on the issue:

https://github.com/apache/jena/pull/436
  
The docs look really good! Only feedback I have:

* It's possible to avoid the empty commit, by amending the last commit 
adding "this closes #123". That's what I normally do when I work in other 
projects.
* Should we also merge + --no-ff, or is fetch/rebase all right too? I've 
done it a few times, but happy to do merge + --no-ff instead if desirable 
* Perhaps a link in the website for it, under Contributing? Or a new page? 
It's probably related to 
http://jena.apache.org/getting_involved/reviewing_contributions.html too? (or 
maybe there's already a link somewhere?)

Thanks


---


Commit workflow

2018-06-20 Thread Andy Seaborne

Aaron, Chris, all,

One thing to mention - with the master repo on Apache hardware and PRs 
on the mirror (which we don't have write access to) there is a workflow 
for merging into the Apache github repo:


https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF

Andy

(gitbox?)


RE: CMS diff: TDB Datasets

2018-06-20 Thread Greg Albiston
Hi Andy,

Thanks for the response. Your suggestion worked and the query completed in a 
similar time to the union graph approach.
I'd tried moving the filter into the graph clause but not swapping the graph 
order.

I added that update on the documentation so if anyone else was having similar 
problems it might help.
Do you still want me to create a JIRA for it?

More generally, is there a page/section for tips on query writing to help 
optimisation? 
I searched but could only find description of TDB's optimisation functionality 
and extending query execution. I spent quite a while hunting for tips and 
trying different ways to influence the resolution order until I thought I'd try 
the union graph.

Thanks,

Greg 

-Original Message-
From: Andy Seaborne  
Sent: 19 June 2018 13:56
To: dev@jena.apache.org; Greg Albiston 
Subject: Re: CMS diff: TDB Datasets

Greg,

Could you create a JIRA ticket for this please?  It is something that looks 
addressable.  The solution proposed (using union graph) is a bit specialised.

 Andy

The query may be better if written (but the "..." may be making a
difference.)

  GRAPH dataset:SmallB {
?b rdf:type my:BThing.
?b my:hasData ?bData.
   FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 2.0"^^my:dataLiteral)) }

GRAPH dataset:BigA {
   ?a rdf:type my:AThing.
   ?a noa:hasGeometry ?aData.
}
FILTER(my:filterFunction1(?bData, ?aData))



On 19/06/18 10:59, Greg Albiston wrote:
> Clone URL (Committers only):
> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://j
> ena.apache.org/documentation%2Ftdb%2Fdatasets.mdtext
> 
> Greg Albiston
> 
> Index: trunk/content/documentation/tdb/datasets.mdtext
> ===
> --- trunk/content/documentation/tdb/datasets.mdtext   (revision 1833775)
> +++ trunk/content/documentation/tdb/datasets.mdtext   (working copy)
> @@ -51,6 +51,51 @@
>   ...
>   }
>   
> +### Named Graphs & Filters
> +
> +Named graphs provide a convenient way to organise and store your data.
> +However, be aware that in certain situations named graphs can make it 
> difficult for the query optimiser.
> +
> +For example, a query with the following structure took 29 minutes to 
> complete:
> +
> +SELECT ?b ...
> +WHERE {
> +
> +GRAPH dataset:BigA {
> +?a rdf:type my:AThing.
> +?a noa:hasGeometry ?aData.
> +...
> +}
> + 
> +GRAPH dataset:SmallB {
> +?b rdf:type my:BThing.
> +?b my:hasData ?bData.
> +...  
> +}
> +
> +FILTER(my:filterFunction1(?bData, ?aData))
> +FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 
> + 2.0"^^my:dataLiteral) )
> +
> +}
> +
> +The completion duration was reduced to 7 seconds by applying the global 
> TDB.symUnionDefaultGraph option (see above) to the dataset and modifying the 
> query as follows:
> +
> +SELECT ?b ...
> +WHERE {
> +
> +?a rdf:type my:AThing.
> +?a noa:hasGeometry ?aData.
> +...
> +
> +?b rdf:type my:BThing.
> +?b my:hasData ?bData.
> +...  
> +
> +FILTER(my:filterFunction1(?bData, ?aData))
> +FILTER(my:filterFunction2(?bData, "1.0 3.0, 4.0 
> + 2.0"^^my:dataLiteral) )
> +
> +}
> +
>   ## Special Graph Names
>   
>   URI | Meaning
> 


[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517984#comment-16517984
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user afs commented on the issue:

https://github.com/apache/jena/pull/436
  
I have a copy of Jena cloned from git-wip-us.apache.org = a added second 
remote for github (this is so it is called as "github", not the full URL 
"https://github.com/apache/jena.git;).

To merge, I have been using that one, merge from github (--no-ff) into 
local master and push to origin/master.

Is there anything we can do to improve 
https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF?


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the language tag 

[GitHub] jena issue #436: JENA-1556 implementation

2018-06-20 Thread afs
Github user afs commented on the issue:

https://github.com/apache/jena/pull/436
  
I have a copy of Jena cloned from git-wip-us.apache.org = a added second 
remote for github (this is so it is called as "github", not the full URL 
"https://github.com/apache/jena.git;).

To merge, I have been using that one, merge from github (--no-ff) into 
local master and push to origin/master.

Is there anything we can do to improve 
https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF?


---


[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517973#comment-16517973
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user afs commented on the issue:

https://github.com/apache/jena/pull/436
  
Committer workflow:

https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the language tag for the query string along with 
> the various fields that need to be considered.
> Supposing that the query is:
> {code:java}
> (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) 
> {code}
> Then the query formed in {{TextIndexLucene}} will be:
> {code:java}
> label_bo:rje label_bo-x-ewts:rje label_bo-alalc97:rje
> {code}
> which is 

[GitHub] jena issue #436: JENA-1556 implementation

2018-06-20 Thread afs
Github user afs commented on the issue:

https://github.com/apache/jena/pull/436
  
Committer workflow:

https://cwiki.apache.org/confluence/display/JENA/Commit+Workflow+for+Github-ASF


---


[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517938#comment-16517938
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user rvesse commented on the issue:

https://github.com/apache/jena/pull/436
  
@xristy I think you can just merge as you would normally though you 
probably need to be explicit about the remote and branch you are using for 
remote operations:

```
> git checkout master
> git pull apache-origin master
> git merge Jena-1556-MutilingualEnhancements-3.8.0
> git push apache-origin master
```


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the language tag for the query string along with 
> the various fields that need to be considered.
> Supposing that 

[GitHub] jena issue #436: JENA-1556 implementation

2018-06-20 Thread rvesse
Github user rvesse commented on the issue:

https://github.com/apache/jena/pull/436
  
@xristy I think you can just merge as you would normally though you 
probably need to be explicit about the remote and branch you are using for 
remote operations:

```
> git checkout master
> git pull apache-origin master
> git merge Jena-1556-MutilingualEnhancements-3.8.0
> git push apache-origin master
```


---