CMS diff: Jena Full Text Search

2019-04-04 Thread Damien Fontaine
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Damien Fontaine

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1856904)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -952,6 +952,11 @@
 "(john jon jonathan~) peters*".  This is useful for performing wildcard
 or fuzzy queries on individual terms in a phrase.
 
+* `SurroundQueryParser`: Provides positional operators (w and n) 
+that accept a numeric distance, as well as boolean 
+operators (and, or, and not, wildcards (* and ?), quoting (with "), 
+and boosting (via ^).
+
 The query parser is specified on
 the `TextIndexLucene` resource:
 



Re: CMS diff: Jena Full Text Search

2019-03-13 Thread Andy Seaborne

Done.

On 25/02/2019 15:22, Chris Tomlinson wrote:

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1854295)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -4,7 +4,7 @@
  [Lucene](https://lucene.apache.org) or
  [ElasticSearch](https://www.elastic.co) (built on
  Lucene). It gives applications the ability to perform indexed full text
-searches within SPARQL queries. Here is a compatibility table:
+searches within SPARQL queries. Here is a version compatibility table:
  
  | Jena | Lucene |  Solr | ElasticSearch |

  ||---|||



CMS diff: Jena Full Text Search

2019-02-25 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1854295)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -4,7 +4,7 @@
 [Lucene](https://lucene.apache.org) or
 [ElasticSearch](https://www.elastic.co) (built on
 Lucene). It gives applications the ability to perform indexed full text
-searches within SPARQL queries. Here is a compatibility table:
+searches within SPARQL queries. Here is a version compatibility table:
 
 | Jena | Lucene |  Solr | 
ElasticSearch |
 ||---|||



Re: CMS diff: Jena Full Text Search

2019-02-22 Thread Chris Tomlinson
I have (finally!) updated the jena text query documentation for the 
improvements that Vincent Ventresque submitted.

Thank you Vincent for the contribution and your patience.

Regards,
Chris


> On Jan 23, 2019, at 12:01 PM, vincent.ventres...@ens-lyon.fr 
>  wrote:
> 
> Clone URL (Committers only):
> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
> 
> vincent.ventres...@ens-lyon.fr
> 
> Index: trunk/content/documentation/query/text-query.mdtext
> ===
> --- trunk/content/documentation/query/text-query.mdtext   (revision 
> 1851871)
> +++ trunk/content/documentation/query/text-query.mdtext   (working copy)
> @@ -609,21 +609,47 @@
> index field. More complex setups, with multiple properties per entity
> (URI) are possible.
> 
> +The assembler file can be either default configuration file 
> (.../run/config.ttl)
> +or a custom file in ...run/configuration folder. Note that you can use 
> several files
> +simultaneously.
> +
> +You have to edit the file (see comments in the assembler code below):
> +
> +1. provide values for paths and a fixed URI for tdb:DatasetTDB
> +2. modify the entity map : add the fields you want to index and desired 
> options (filters, tokenizers...)
> +
> +If your assembler file is run/config.ttl, you can index the dataset with 
> this command :
> +
> +java -cp ./fuseki-server.jar jena.textindexer --desc=run/config.ttl
> +
> Once configured, any data added to the text dataset is automatically
> -indexed as well.
> +indexed as well : 
> https://jena.apache.org/documentation/query/text-query.html#building-a-text-index
> 
> +When you change the jena-text in significant ways, such as changing what 
> analyzer 
> +is used for a given property and so on, then you’ll need to rebuild the 
> Lucene index 
> +via reloading the dataset or using the textIndexer.
> +
> ### Text Dataset Assembler
> 
> The following is an example of a TDB dataset with a text index.
> 
> + Example of a TDB dataset and text index#
> +# The main doc sources are:
> +#  - 
> https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html
> +#  - https://jena.apache.org/documentation/assembler/assembler-howto.html
> +#  - https://jena.apache.org/documentation/assembler/assembler.ttl
> +# See https://jena.apache.org/documentation/fuseki2/fuseki-layout.html 
> for the destination of this file.
> +#
> +
> @prefix : .
> @prefix rdf:  .
> @prefix rdfs: .
> @prefix tdb:  .
> @prefix ja:   .
> @prefix text: .
> +@prefix skos: 
> +@prefix fuseki:   .
> 
> -## Example of a TDB dataset and text index
> ## Initialize TDB
> [] ja:loadClass "org.apache.jena.tdb.TDB" .
> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
> @@ -631,39 +657,64 @@
> 
> ## Initialize text query
> [] ja:loadClass   "org.apache.jena.query.text.TextQuery" .
> +
> # A TextDataset is a regular dataset with a text index.
> text:TextDataset  rdfs:subClassOf   ja:RDFDataset .
> +
> # Lucene index
> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
> -# Elasticsearch index
> -text:TextIndexESrdfs:subClassOf   text:TextIndex .
> 
> +
> ## ---
> -## This URI must be fixed - it's used to assemble the text dataset.
> 
> :text_dataset rdf:type text:TextDataset ;
> -text:dataset   <#dataset> ;
> +text:dataset   :my_dataset ; # <-- 
> replace `:my_dataset` with the desired URI
> text:index <#indexLucene> ;
> -.
> +.
> 
> # A TDB dataset used for RDF storage
> -<#dataset> rdf:type  tdb:DatasetTDB ;
> -tdb:location "DB" ;
> -tdb:unionDefaultGraph true ; # Optional
> -.
> 
> -# Text index description
> +:my_dataset rdf:type  tdb:DatasetTDB ;   # <-- 
> replace `:my_dataset` with the desired URI
> +tdb:location "/tmp/tdb-dataset/" ;   # <-- 
> replace `/tmp/tdb-dataset/` with your path 
> (`.../fuseki/run/databases/MY_DATASET`)
> +#tdb:unionDefaultGraph true ; # Optional
> +.
> +
> +# Text index description (see documentation for other options)
> +
> <#indexLucene> a text:TextIndexLucene ;
> -text:directory  ;
> +text:directory  ;# <-- 
> replace ` with your path` 
> (``)
> 

Re: CMS diff: Jena Full Text Search

2019-02-05 Thread ajs6f
I think I left this drop, and I apologize. Vincent, can you resend your patch, 
but this time without the unnecessary assertions (as pointed out by Andy et 
al.)? If not, that's okay, I can reassemble it myself from the mailing list.

Thanks either way!

ajs6f

> On Jan 29, 2019, at 4:42 AM, Andy Seaborne  wrote:
> 
> 
> 
> On 28/01/2019 21:01, vincent ventresque wrote:
>> Hi Ajs6f
>> Thanks for including me in the conversation, but I have to confess I've 
>> never looked at java classes (I only use command line tools).
>> Le 28/01/2019 à 21:05, ajs6f a écrit :
 On Jan 28, 2019, at 2:57 PM, Chris Tomlinson  
 wrote:
 
 Hi Adam,
 
 I haven’t seen that error. What I’ve done in the past is to replace the 
 jena-text doc file with the new contents in Eclipse in an SVN checkout of 
 the jena-doc-site and then committed.
>>> I can definitely do that (and will when we're happy with the patch), but 
>>> see below.
>>> 
 Out of curiosity when is it necessary to use the
 
  [] ja:loadClass "org.apache.jena.tdb.TDB” .
 
 and
 
 [] ja:loadClass   "org.apache.jena.query.text.TextQuery” .
> 
> It's not necessary any more.
> 
> The ServiceLoader Jena initialization does this.
> 
> The other initialization step should also be unnecessary:
> 
> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
> tdb:GraphTDBrdfs:subClassOf  ja:Model .
> 
 
 ? I do not use them in the config when running fuseki war in tomcat.
>>> I have no idea whatsoever! :grin: I wouldn't have thought them needed 
>>> either.
>>> 
>>> Vincent-- any comment?
>>> 
>>> ajs6f
>>> 
>>> 
 Regards,
 Chris
 
 
 
> On Jan 28, 2019, at 11:11 AM, ajs6f  wrote:
> 
> Recently Vincent offered a nice patch to our text indexing documentation, 
> as shown below. Oddly, when I now go to merge it (a bit late, sorry!), I 
> get an error: "Can't locate anonymous's tree to clone". Is anyone 
> familiar with that? I know very little about the SVN-based CMS, so I'm 
> not even sure where to start looking...
> 
> ajs6f



Re: CMS diff: Jena Full Text Search

2019-01-29 Thread Andy Seaborne




On 28/01/2019 21:01, vincent ventresque wrote:

Hi Ajs6f

Thanks for including me in the conversation, but I have to confess I've 
never looked at java classes (I only use command line tools).


Le 28/01/2019 à 21:05, ajs6f a écrit :
On Jan 28, 2019, at 2:57 PM, Chris Tomlinson 
 wrote:


Hi Adam,

I haven’t seen that error. What I’ve done in the past is to replace 
the jena-text doc file with the new contents in Eclipse in an SVN 
checkout of the jena-doc-site and then committed.
I can definitely do that (and will when we're happy with the patch), 
but see below.



Out of curiosity when is it necessary to use the

 [] ja:loadClass "org.apache.jena.tdb.TDB” .

and

    [] ja:loadClass   "org.apache.jena.query.text.TextQuery” .


It's not necessary any more.

The ServiceLoader Jena initialization does this.

The other initialization step should also be unnecessary:

 tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
 tdb:GraphTDBrdfs:subClassOf  ja:Model .



? I do not use them in the config when running fuseki war in tomcat.
I have no idea whatsoever! :grin: I wouldn't have thought them needed 
either.


Vincent-- any comment?

ajs6f



Regards,
Chris




On Jan 28, 2019, at 11:11 AM, ajs6f  wrote:

Recently Vincent offered a nice patch to our text indexing 
documentation, as shown below. Oddly, when I now go to merge it (a 
bit late, sorry!), I get an error: "Can't locate anonymous's tree to 
clone". Is anyone familiar with that? I know very little about the 
SVN-based CMS, so I'm not even sure where to start looking...


ajs6f


Re: CMS diff: Jena Full Text Search

2019-01-28 Thread vincent ventresque

Hi Ajs6f

Thanks for including me in the conversation, but I have to confess I've 
never looked at java classes (I only use command line tools).


Le 28/01/2019 à 21:05, ajs6f a écrit :

On Jan 28, 2019, at 2:57 PM, Chris Tomlinson  
wrote:

Hi Adam,

I haven’t seen that error. What I’ve done in the past is to replace the 
jena-text doc file with the new contents in Eclipse in an SVN checkout of the 
jena-doc-site and then committed.

I can definitely do that (and will when we're happy with the patch), but see 
below.


Out of curiosity when is it necessary to use the

 [] ja:loadClass "org.apache.jena.tdb.TDB” .

and

[] ja:loadClass   "org.apache.jena.query.text.TextQuery” .

? I do not use them in the config when running fuseki war in tomcat.

I have no idea whatsoever! :grin: I wouldn't have thought them needed either.

Vincent-- any comment?

ajs6f



Regards,
Chris




On Jan 28, 2019, at 11:11 AM, ajs6f  wrote:

Recently Vincent offered a nice patch to our text indexing documentation, as shown below. 
Oddly, when I now go to merge it (a bit late, sorry!), I get an error: "Can't locate 
anonymous's tree to clone". Is anyone familiar with that? I know very little about 
the SVN-based CMS, so I'm not even sure where to start looking...

ajs6f


Re: CMS diff: Jena Full Text Search

2019-01-28 Thread ajs6f


> On Jan 28, 2019, at 2:57 PM, Chris Tomlinson  
> wrote:
> 
> Hi Adam,
> 
> I haven’t seen that error. What I’ve done in the past is to replace the 
> jena-text doc file with the new contents in Eclipse in an SVN checkout of the 
> jena-doc-site and then committed.

I can definitely do that (and will when we're happy with the patch), but see 
below.

> Out of curiosity when is it necessary to use the
> 
> [] ja:loadClass "org.apache.jena.tdb.TDB” .
> 
> and
> 
>[] ja:loadClass   "org.apache.jena.query.text.TextQuery” .
> 
> ? I do not use them in the config when running fuseki war in tomcat.

I have no idea whatsoever! :grin: I wouldn't have thought them needed either.

Vincent-- any comment?

ajs6f


> Regards,
> Chris
> 
> 
> 
>> On Jan 28, 2019, at 11:11 AM, ajs6f  wrote:
>> 
>> Recently Vincent offered a nice patch to our text indexing documentation, as 
>> shown below. Oddly, when I now go to merge it (a bit late, sorry!), I get an 
>> error: "Can't locate anonymous's tree to clone". Is anyone familiar with 
>> that? I know very little about the SVN-based CMS, so I'm not even sure where 
>> to start looking...
>> 
>> ajs6f
> 



Re: CMS diff: Jena Full Text Search

2019-01-28 Thread Chris Tomlinson
Hi Adam,

I haven’t seen that error. What I’ve done in the past is to replace the 
jena-text doc file with the new contents in Eclipse in an SVN checkout of the 
jena-doc-site and then committed.

Out of curiosity when is it necessary to use the

 [] ja:loadClass "org.apache.jena.tdb.TDB” .

and

[] ja:loadClass   "org.apache.jena.query.text.TextQuery” .

? I do not use them in the config when running fuseki war in tomcat.

Regards,
Chris



> On Jan 28, 2019, at 11:11 AM, ajs6f  wrote:
> 
> Recently Vincent offered a nice patch to our text indexing documentation, as 
> shown below. Oddly, when I now go to merge it (a bit late, sorry!), I get an 
> error: "Can't locate anonymous's tree to clone". Is anyone familiar with 
> that? I know very little about the SVN-based CMS, so I'm not even sure where 
> to start looking...
> 
> ajs6f



Re: CMS diff: Jena Full Text Search

2019-01-28 Thread ajs6f
Recently Vincent offered a nice patch to our text indexing documentation, as 
shown below. Oddly, when I now go to merge it (a bit late, sorry!), I get an 
error: "Can't locate anonymous's tree to clone". Is anyone familiar with that? 
I know very little about the SVN-based CMS, so I'm not even sure where to start 
looking...

ajs6f

> On Jan 23, 2019, at 12:01 PM, vincent.ventres...@ens-lyon.fr 
>  wrote:
> 
> Clone URL (Committers only):
> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
> 
> vincent.ventres...@ens-lyon.fr
> 
> Index: trunk/content/documentation/query/text-query.mdtext
> ===
> --- trunk/content/documentation/query/text-query.mdtext   (revision 
> 1851871)
> +++ trunk/content/documentation/query/text-query.mdtext   (working copy)
> @@ -609,21 +609,47 @@
> index field. More complex setups, with multiple properties per entity
> (URI) are possible.
> 
> +The assembler file can be either default configuration file 
> (.../run/config.ttl)
> +or a custom file in ...run/configuration folder. Note that you can use 
> several files
> +simultaneously.
> +
> +You have to edit the file (see comments in the assembler code below):
> +
> +1. provide values for paths and a fixed URI for tdb:DatasetTDB
> +2. modify the entity map : add the fields you want to index and desired 
> options (filters, tokenizers...)
> +
> +If your assembler file is run/config.ttl, you can index the dataset with 
> this command :
> +
> +java -cp ./fuseki-server.jar jena.textindexer --desc=run/config.ttl
> +
> Once configured, any data added to the text dataset is automatically
> -indexed as well.
> +indexed as well : 
> https://jena.apache.org/documentation/query/text-query.html#building-a-text-index
> 
> +When you change the jena-text in significant ways, such as changing what 
> analyzer 
> +is used for a given property and so on, then you’ll need to rebuild the 
> Lucene index 
> +via reloading the dataset or using the textIndexer.
> +
> ### Text Dataset Assembler
> 
> The following is an example of a TDB dataset with a text index.
> 
> + Example of a TDB dataset and text index#
> +# The main doc sources are:
> +#  - 
> https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html
> +#  - https://jena.apache.org/documentation/assembler/assembler-howto.html
> +#  - https://jena.apache.org/documentation/assembler/assembler.ttl
> +# See https://jena.apache.org/documentation/fuseki2/fuseki-layout.html 
> for the destination of this file.
> +#
> +
> @prefix : .
> @prefix rdf:  .
> @prefix rdfs: .
> @prefix tdb:  .
> @prefix ja:   .
> @prefix text: .
> +@prefix skos: 
> +@prefix fuseki:   .
> 
> -## Example of a TDB dataset and text index
> ## Initialize TDB
> [] ja:loadClass "org.apache.jena.tdb.TDB" .
> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
> @@ -631,39 +657,64 @@
> 
> ## Initialize text query
> [] ja:loadClass   "org.apache.jena.query.text.TextQuery" .
> +
> # A TextDataset is a regular dataset with a text index.
> text:TextDataset  rdfs:subClassOf   ja:RDFDataset .
> +
> # Lucene index
> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
> -# Elasticsearch index
> -text:TextIndexESrdfs:subClassOf   text:TextIndex .
> 
> +
> ## ---
> -## This URI must be fixed - it's used to assemble the text dataset.
> 
> :text_dataset rdf:type text:TextDataset ;
> -text:dataset   <#dataset> ;
> +text:dataset   :my_dataset ; # <-- 
> replace `:my_dataset` with the desired URI
> text:index <#indexLucene> ;
> -.
> +.
> 
> # A TDB dataset used for RDF storage
> -<#dataset> rdf:type  tdb:DatasetTDB ;
> -tdb:location "DB" ;
> -tdb:unionDefaultGraph true ; # Optional
> -.
> 
> -# Text index description
> +:my_dataset rdf:type  tdb:DatasetTDB ;   # <-- 
> replace `:my_dataset` with the desired URI
> +tdb:location "/tmp/tdb-dataset/" ;   # <-- 
> replace `/tmp/tdb-dataset/` with your path 
> (`.../fuseki/run/databases/MY_DATASET`)
> +#tdb:unionDefaultGraph true ; # Optional
> +.
> +
> +# Text index description (see documentation for other options)
> +
> <#indexLucene> a 

CMS diff: Jena Full Text Search

2019-01-23 Thread vincent . ventresque
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

vincent.ventres...@ens-lyon.fr

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1851871)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -609,21 +609,47 @@
 index field. More complex setups, with multiple properties per entity
 (URI) are possible.
 
+The assembler file can be either default configuration file 
(.../run/config.ttl)
+or a custom file in ...run/configuration folder. Note that you can use several 
files
+simultaneously.
+
+You have to edit the file (see comments in the assembler code below):
+
+1. provide values for paths and a fixed URI for tdb:DatasetTDB
+2. modify the entity map : add the fields you want to index and desired 
options (filters, tokenizers...)
+
+If your assembler file is run/config.ttl, you can index the dataset with this 
command :
+
+java -cp ./fuseki-server.jar jena.textindexer --desc=run/config.ttl
+
 Once configured, any data added to the text dataset is automatically
-indexed as well.
+indexed as well : 
https://jena.apache.org/documentation/query/text-query.html#building-a-text-index
 
+When you change the jena-text in significant ways, such as changing what 
analyzer 
+is used for a given property and so on, then you’ll need to rebuild the Lucene 
index 
+via reloading the dataset or using the textIndexer.
+
 ### Text Dataset Assembler
 
 The following is an example of a TDB dataset with a text index.
 
+ Example of a TDB dataset and text index#
+# The main doc sources are:
+#  - 
https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html
+#  - https://jena.apache.org/documentation/assembler/assembler-howto.html
+#  - https://jena.apache.org/documentation/assembler/assembler.ttl
+# See https://jena.apache.org/documentation/fuseki2/fuseki-layout.html for 
the destination of this file.
+#
+
 @prefix : .
 @prefix rdf:  .
 @prefix rdfs: .
 @prefix tdb:  .
 @prefix ja:   .
 @prefix text: .
+@prefix skos: 
+@prefix fuseki:   .
 
-## Example of a TDB dataset and text index
 ## Initialize TDB
 [] ja:loadClass "org.apache.jena.tdb.TDB" .
 tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
@@ -631,39 +657,64 @@
 
 ## Initialize text query
 [] ja:loadClass   "org.apache.jena.query.text.TextQuery" .
+
 # A TextDataset is a regular dataset with a text index.
 text:TextDataset  rdfs:subClassOf   ja:RDFDataset .
+
 # Lucene index
 text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
-# Elasticsearch index
-text:TextIndexESrdfs:subClassOf   text:TextIndex .
 
+
 ## ---
-## This URI must be fixed - it's used to assemble the text dataset.
 
 :text_dataset rdf:type text:TextDataset ;
-text:dataset   <#dataset> ;
+text:dataset   :my_dataset ; # <-- replace 
`:my_dataset` with the desired URI
 text:index <#indexLucene> ;
-.
+.
 
 # A TDB dataset used for RDF storage
-<#dataset> rdf:type  tdb:DatasetTDB ;
-tdb:location "DB" ;
-tdb:unionDefaultGraph true ; # Optional
-.
 
-# Text index description
+:my_dataset rdf:type  tdb:DatasetTDB ;   # <-- replace 
`:my_dataset` with the desired URI
+tdb:location "/tmp/tdb-dataset/" ;   # <-- replace 
`/tmp/tdb-dataset/` with your path (`.../fuseki/run/databases/MY_DATASET`)
+#tdb:unionDefaultGraph true ; # Optional
+.
+
+# Text index description (see documentation for other options)
+
 <#indexLucene> a text:TextIndexLucene ;
-text:directory  ;
+text:directory  ;# <-- replace 
` with your path` 
(``)
 text:entityMap <#entMap> ;
-text:storeValues true ; 
+text:storeValues true ;
 text:analyzer [ a text:StandardAnalyzer ] ;
 text:queryAnalyzer [ a text:KeywordAnalyzer ] ;
 text:queryParser text:AnalyzingQueryParser ;
-text:defineAnalyzers [ . . . ] ;
 text:multilingualSupport true ;
- .
+.
 
+# Entity map (see documentation for other options)
+
+<#entMap> a text:EntityMap ;
+text:defaultField 

Re: CMS diff: Jena Full Text Search

2018-01-29 Thread Andy Seaborne

Done.

I think all the Unicode characters got into the final result correctly.

Andy

On 28/01/18 16:45, Chris Tomlinson wrote:

Andy,

Yes I was just word-smithing and tweaking punctuation.

Thanks,
Chris



On Jan 28, 2018, at 10:30 AM, Andy Seaborne  wrote:

Chris,

There's 3 diffs a few minutes apart.

This 3rd CMS diff is the right one to apply and includes the superceeds the 
previous ones?

Andy

On 22/01/18 03:15, Chris Tomlinson wrote:

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
Chris Tomlinson
Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1821823)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
  Title: Jena Full Text Search
  +Title: Jena Full Text Search
+
  This extension to ARQ combines SPARQL and full text search via
  [Lucene](https://lucene.apache.org) 6.4.1 or
  [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -231,7 +233,7 @@
The most general form is:
 - (?s ?score ?literal ?g) text:query (property 'query string' limit 
'lang:xx')
+ ( ?s ?score ?literal ?g ) text:query ( property 'query string' limit 
'lang:xx' 'highlight:yy' )
 Input arguments:
  @@ -241,13 +243,13 @@
  | query string  | Lucene query string fragment   |
  | limit | (optional) `int` limit on the number of results   |
  | lang:xx   | (optional) language tag spec   |
-| highlight:xx  | (optional) highlighting options|
+| highlight:yy  | (optional) highlighting options|
The `property` URI is only necessary if multiple properties have been
  indexed and the property being searched over is not the [default field
  of the index](#entity-map-definition).
  -The `query string` syntax conforms the underlying index 
[Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+The `query string` syntax conforms to the underlying index 
[Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
  or
  
[Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
 In the case of Lucene the syntax is restricted to `Terms`, `Term modifiers`, 
`Boolean Operators` applied to `Terms`, and `Grouping` of terms. _No use of 
`Fields` within the `query string` is supported._
  @@ -258,9 +260,9 @@
  indexed with the tag _xx_. Searches may be restricted to field values with no
  language tag via `"lang:none"`.
  -The `highlight:xx` specification is an optional string where _xx_ are 
options that control the highlighting of search result literals. See 
[below](#highlighting) for details.
+The `highlight:yy` specification is an optional string where _yy_ are options 
that control the highlighting of search result literals. See 
[below](#highlighting) for details.
  -If both `limit` and one or more of `lang:xx` or `highlight:xx` are present, 
then `limit` must precede these arguments.
+If both `limit` and one or more of `lang:xx` or `highlight:yy` are present, 
then `limit` must precede these arguments.
If only the query string is required, the surrounding `( )` _may be_ 
omitted.
  @@ -499,7 +501,7 @@
 Highlighting
  -The highlighting option uses the Lucene `Highlighter` and 
`SimpleHTMLFormatter` to insert highlighting markup into the literals returned 
from search results (hence the text dataset must be configured to store the 
literals). The highlighted results are returned via the _literal_ output 
argument.
+The highlighting option uses the Lucene `Highlighter` and 
`SimpleHTMLFormatter` to insert highlighting markup into the literals returned 
from search results (hence the text dataset must be configured to store the 
literals). The highlighted results are returned via the _literal_ output 
argument. This highlighting feature, introduced in version 3.7.0, does not 
require re-indexing by Lucene.
The simplest way to request highlighting is via `'highlight:'`. This will 
apply all the defaults:
  @@ -521,7 +523,7 @@
"the quick ↦brown fox↤ jumped over the lazy baboon"
  -The `RIGHT_ARROW` is Unicode \u21a6 and the `LEFT_ARROW` is Unicode \u21a4. 
These are chosen to be single characters that in most situations will be very 
unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be 
large enough that in many situations the matches will result in single 
fragments. If the literal is larger than 128 characters and there are several 
matches in the literal then there may be additional fragments separated by the 
`DIVIDES`, Unicode \u2223.
+The `RIGHT_ARROW` is Unicode, \u21a6, and the `LEFT_ARROW` 

Re: CMS diff: Jena Full Text Search

2018-01-28 Thread Chris Tomlinson
Andy,

Yes I was just word-smithing and tweaking punctuation.

Thanks,
Chris


> On Jan 28, 2018, at 10:30 AM, Andy Seaborne  wrote:
> 
> Chris,
> 
> There's 3 diffs a few minutes apart.
> 
> This 3rd CMS diff is the right one to apply and includes the superceeds the 
> previous ones?
> 
>Andy
> 
> On 22/01/18 03:15, Chris Tomlinson wrote:
>> Clone URL (Committers only):
>> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
>> Chris Tomlinson
>> Index: trunk/content/documentation/query/text-query.mdtext
>> ===
>> --- trunk/content/documentation/query/text-query.mdtext  (revision 
>> 1821823)
>> +++ trunk/content/documentation/query/text-query.mdtext  (working copy)
>> @@ -1,5 +1,7 @@
>>  Title: Jena Full Text Search
>>  +Title: Jena Full Text Search
>> +
>>  This extension to ARQ combines SPARQL and full text search via
>>  [Lucene](https://lucene.apache.org) 6.4.1 or
>>  [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
>> @@ -231,7 +233,7 @@
>>The most general form is:
>> - (?s ?score ?literal ?g) text:query (property 'query string' limit 
>> 'lang:xx')
>> + ( ?s ?score ?literal ?g ) text:query ( property 'query string' limit 
>> 'lang:xx' 'highlight:yy' )
>> Input arguments:
>>  @@ -241,13 +243,13 @@
>>  | query string  | Lucene query string fragment   |
>>  | limit | (optional) `int` limit on the number of results   
>> |
>>  | lang:xx   | (optional) language tag spec   |
>> -| highlight:xx  | (optional) highlighting options|
>> +| highlight:yy  | (optional) highlighting options|
>>The `property` URI is only necessary if multiple properties have been
>>  indexed and the property being searched over is not the [default field
>>  of the index](#entity-map-definition).
>>  -The `query string` syntax conforms the underlying index 
>> [Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
>> +The `query string` syntax conforms to the underlying index 
>> [Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
>>  or
>>  
>> [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
>>  In the case of Lucene the syntax is restricted to `Terms`, `Term 
>> modifiers`, `Boolean Operators` applied to `Terms`, and `Grouping` of terms. 
>> _No use of `Fields` within the `query string` is supported._
>>  @@ -258,9 +260,9 @@
>>  indexed with the tag _xx_. Searches may be restricted to field values with 
>> no
>>  language tag via `"lang:none"`.
>>  -The `highlight:xx` specification is an optional string where _xx_ are 
>> options that control the highlighting of search result literals. See 
>> [below](#highlighting) for details.
>> +The `highlight:yy` specification is an optional string where _yy_ are 
>> options that control the highlighting of search result literals. See 
>> [below](#highlighting) for details.
>>  -If both `limit` and one or more of `lang:xx` or `highlight:xx` are 
>> present, then `limit` must precede these arguments.
>> +If both `limit` and one or more of `lang:xx` or `highlight:yy` are present, 
>> then `limit` must precede these arguments.
>>If only the query string is required, the surrounding `( )` _may be_ 
>> omitted.
>>  @@ -499,7 +501,7 @@
>> Highlighting
>>  -The highlighting option uses the Lucene `Highlighter` and 
>> `SimpleHTMLFormatter` to insert highlighting markup into the literals 
>> returned from search results (hence the text dataset must be configured to 
>> store the literals). The highlighted results are returned via the _literal_ 
>> output argument.
>> +The highlighting option uses the Lucene `Highlighter` and 
>> `SimpleHTMLFormatter` to insert highlighting markup into the literals 
>> returned from search results (hence the text dataset must be configured to 
>> store the literals). The highlighted results are returned via the _literal_ 
>> output argument. This highlighting feature, introduced in version 3.7.0, 
>> does not require re-indexing by Lucene.
>>The simplest way to request highlighting is via `'highlight:'`. This will 
>> apply all the defaults:
>>  @@ -521,7 +523,7 @@
>>"the quick ↦brown fox↤ jumped over the lazy baboon"
>>  -The `RIGHT_ARROW` is Unicode \u21a6 and the `LEFT_ARROW` is Unicode 
>> \u21a4. These are chosen to be single characters that in most situations 
>> will be very unlikely to occur in resulting literals. The `fragSize` of 128 
>> is chosen to be large enough that in many situations the matches will result 
>> in single fragments. If the literal is larger than 128 characters and there 
>> are several matches in the literal then there may be additional 

Re: CMS diff: Jena Full Text Search

2018-01-28 Thread Andy Seaborne

Chris,

There's 3 diffs a few minutes apart.

This 3rd CMS diff is the right one to apply and includes the superceeds 
the previous ones?


Andy

On 22/01/18 03:15, Chris Tomlinson wrote:

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1821823)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
  Title: Jena Full Text Search
  
+Title: Jena Full Text Search

+
  This extension to ARQ combines SPARQL and full text search via
  [Lucene](https://lucene.apache.org) 6.4.1 or
  [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -231,7 +233,7 @@
  
  The most general form is:
 
- (?s ?score ?literal ?g) text:query (property 'query string' limit 'lang:xx')

+ ( ?s ?score ?literal ?g ) text:query ( property 'query string' limit 
'lang:xx' 'highlight:yy' )
  
   Input arguments:
  
@@ -241,13 +243,13 @@

  | query string  | Lucene query string fragment   |
  | limit | (optional) `int` limit on the number of results   |
  | lang:xx   | (optional) language tag spec   |
-| highlight:xx  | (optional) highlighting options|
+| highlight:yy  | (optional) highlighting options|
  
  The `property` URI is only necessary if multiple properties have been

  indexed and the property being searched over is not the [default field
  of the index](#entity-map-definition).
  
-The `query string` syntax conforms the underlying index [Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)

+The `query string` syntax conforms to the underlying index 
[Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
  or
  
[Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
 In the case of Lucene the syntax is restricted to `Terms`, `Term modifiers`, 
`Boolean Operators` applied to `Terms`, and `Grouping` of terms. _No use of 
`Fields` within the `query string` is supported._
  
@@ -258,9 +260,9 @@

  indexed with the tag _xx_. Searches may be restricted to field values with no
  language tag via `"lang:none"`.
  
-The `highlight:xx` specification is an optional string where _xx_ are options that control the highlighting of search result literals. See [below](#highlighting) for details.

+The `highlight:yy` specification is an optional string where _yy_ are options 
that control the highlighting of search result literals. See 
[below](#highlighting) for details.
  
-If both `limit` and one or more of `lang:xx` or `highlight:xx` are present, then `limit` must precede these arguments.

+If both `limit` and one or more of `lang:xx` or `highlight:yy` are present, 
then `limit` must precede these arguments.
  
  If only the query string is required, the surrounding `( )` _may be_ omitted.
  
@@ -499,7 +501,7 @@
  
   Highlighting
  
-The highlighting option uses the Lucene `Highlighter` and `SimpleHTMLFormatter` to insert highlighting markup into the literals returned from search results (hence the text dataset must be configured to store the literals). The highlighted results are returned via the _literal_ output argument.

+The highlighting option uses the Lucene `Highlighter` and 
`SimpleHTMLFormatter` to insert highlighting markup into the literals returned 
from search results (hence the text dataset must be configured to store the 
literals). The highlighted results are returned via the _literal_ output 
argument. This highlighting feature, introduced in version 3.7.0, does not 
require re-indexing by Lucene.
  
  The simplest way to request highlighting is via `'highlight:'`. This will apply all the defaults:
  
@@ -521,7 +523,7 @@
  
  "the quick ↦brown fox↤ jumped over the lazy baboon"
  
-The `RIGHT_ARROW` is Unicode \u21a6 and the `LEFT_ARROW` is Unicode \u21a4. These are chosen to be single characters that in most situations will be very unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be large enough that in many situations the matches will result in single fragments. If the literal is larger than 128 characters and there are several matches in the literal then there may be additional fragments separated by the `DIVIDES`, Unicode \u2223.

+The `RIGHT_ARROW` is Unicode, \u21a6, and the `LEFT_ARROW` is Unicode, \u21a4. 
These are chosen to be single characters that in most situations will be very 
unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be 
large enough that in many situations the matches will result in single 
fragments. If the 

CMS diff: Jena Full Text Search

2018-01-21 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1821823)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
 Title: Jena Full Text Search
 
+Title: Jena Full Text Search
+
 This extension to ARQ combines SPARQL and full text search via
 [Lucene](https://lucene.apache.org) 6.4.1 or
 [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -231,7 +233,7 @@
 
 The most general form is:

- (?s ?score ?literal ?g) text:query (property 'query string' limit 
'lang:xx')
+ ( ?s ?score ?literal ?g ) text:query ( property 'query string' limit 
'lang:xx' 'highlight:yy' )
 
  Input arguments:
 
@@ -241,13 +243,13 @@
 | query string  | Lucene query string fragment   |
 | limit | (optional) `int` limit on the number of results   |
 | lang:xx   | (optional) language tag spec   |
-| highlight:xx  | (optional) highlighting options|
+| highlight:yy  | (optional) highlighting options|
 
 The `property` URI is only necessary if multiple properties have been
 indexed and the property being searched over is not the [default field
 of the index](#entity-map-definition).
 
-The `query string` syntax conforms the underlying index 
[Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+The `query string` syntax conforms to the underlying index 
[Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
 or
 
[Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
 In the case of Lucene the syntax is restricted to `Terms`, `Term modifiers`, 
`Boolean Operators` applied to `Terms`, and `Grouping` of terms. _No use of 
`Fields` within the `query string` is supported._
 
@@ -258,9 +260,9 @@
 indexed with the tag _xx_. Searches may be restricted to field values with no 
 language tag via `"lang:none"`. 
 
-The `highlight:xx` specification is an optional string where _xx_ are options 
that control the highlighting of search result literals. See 
[below](#highlighting) for details.
+The `highlight:yy` specification is an optional string where _yy_ are options 
that control the highlighting of search result literals. See 
[below](#highlighting) for details.
 
-If both `limit` and one or more of `lang:xx` or `highlight:xx` are present, 
then `limit` must precede these arguments.
+If both `limit` and one or more of `lang:xx` or `highlight:yy` are present, 
then `limit` must precede these arguments.
 
 If only the query string is required, the surrounding `( )` _may be_ omitted.
 
@@ -499,7 +501,7 @@
 
  Highlighting
 
-The highlighting option uses the Lucene `Highlighter` and 
`SimpleHTMLFormatter` to insert highlighting markup into the literals returned 
from search results (hence the text dataset must be configured to store the 
literals). The highlighted results are returned via the _literal_ output 
argument.
+The highlighting option uses the Lucene `Highlighter` and 
`SimpleHTMLFormatter` to insert highlighting markup into the literals returned 
from search results (hence the text dataset must be configured to store the 
literals). The highlighted results are returned via the _literal_ output 
argument. This highlighting feature, introduced in version 3.7.0, does not 
require re-indexing by Lucene. 
 
 The simplest way to request highlighting is via `'highlight:'`. This will 
apply all the defaults:
 
@@ -521,7 +523,7 @@
 
 "the quick ↦brown fox↤ jumped over the lazy baboon"
 
-The `RIGHT_ARROW` is Unicode \u21a6 and the `LEFT_ARROW` is Unicode \u21a4. 
These are chosen to be single characters that in most situations will be very 
unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be 
large enough that in many situations the matches will result in single 
fragments. If the literal is larger than 128 characters and there are several 
matches in the literal then there may be additional fragments separated by the 
`DIVIDES`, Unicode \u2223.
+The `RIGHT_ARROW` is Unicode, \u21a6, and the `LEFT_ARROW` is Unicode, \u21a4. 
These are chosen to be single characters that in most situations will be very 
unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be 
large enough that in many situations the matches will result in single 
fragments. If the literal is larger than 128 characters and there are several 
matches in the literal then there may be additional fragments separated by the 
`DIVIDES`, Unicode, \u2223.
 
 Depending on the analyzer used and the tokenizer, 

CMS diff: Jena Full Text Search

2018-01-21 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1821823)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
 Title: Jena Full Text Search
 
+Title: Jena Full Text Search
+
 This extension to ARQ combines SPARQL and full text search via
 [Lucene](https://lucene.apache.org) 6.4.1 or
 [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -231,7 +233,7 @@
 
 The most general form is:

- (?s ?score ?literal ?g) text:query (property 'query string' limit 
'lang:xx')
+ ( ?s ?score ?literal ?g ) text:query ( property 'query string' limit 
'lang:xx' 'highlight:yy' )
 
  Input arguments:
 
@@ -241,13 +243,13 @@
 | query string  | Lucene query string fragment   |
 | limit | (optional) `int` limit on the number of results   |
 | lang:xx   | (optional) language tag spec   |
-| highlight:xx  | (optional) highlighting options|
+| highlight:yy  | (optional) highlighting options|
 
 The `property` URI is only necessary if multiple properties have been
 indexed and the property being searched over is not the [default field
 of the index](#entity-map-definition).
 
-The `query string` syntax conforms the underlying index 
[Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+The `query string` syntax conforms to the underlying index 
[Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
 or
 
[Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
 In the case of Lucene the syntax is restricted to `Terms`, `Term modifiers`, 
`Boolean Operators` applied to `Terms`, and `Grouping` of terms. _No use of 
`Fields` within the `query string` is supported._
 
@@ -258,9 +260,9 @@
 indexed with the tag _xx_. Searches may be restricted to field values with no 
 language tag via `"lang:none"`. 
 
-The `highlight:xx` specification is an optional string where _xx_ are options 
that control the highlighting of search result literals. See 
[below](#highlighting) for details.
+The `highlight:yy` specification is an optional string where _yy_ are options 
that control the highlighting of search result literals. See 
[below](#highlighting) for details.
 
-If both `limit` and one or more of `lang:xx` or `highlight:xx` are present, 
then `limit` must precede these arguments.
+If both `limit` and one or more of `lang:xx` or `highlight:yy` are present, 
then `limit` must precede these arguments.
 
 If only the query string is required, the surrounding `( )` _may be_ omitted.
 
@@ -521,7 +523,7 @@
 
 "the quick ↦brown fox↤ jumped over the lazy baboon"
 
-The `RIGHT_ARROW` is Unicode \u21a6 and the `LEFT_ARROW` is Unicode \u21a4. 
These are chosen to be single characters that in most situations will be very 
unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be 
large enough that in many situations the matches will result in single 
fragments. If the literal is larger than 128 characters and there are several 
matches in the literal then there may be additional fragments separated by the 
`DIVIDES`, Unicode \u2223.
+The `RIGHT_ARROW` is Unicode, \u21a6, and the `LEFT_ARROW` is Unicode, \u21a4. 
These are chosen to be single characters that in most situations will be very 
unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be 
large enough that in many situations the matches will result in single 
fragments. If the literal is larger than 128 characters and there are several 
matches in the literal then there may be additional fragments separated by the 
`DIVIDES`, Unicode, \u2223.
 
 Depending on the analyzer used and the tokenizer, the highlighting will result 
in marking each token rather than an entire phrase. The `joinHi` option is by 
default `true` so that entire phrases are highlighted together rather than as 
individual tokens as in:
 



CMS diff: Jena Full Text Search

2018-01-21 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1821823)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
 Title: Jena Full Text Search
 
+Title: Jena Full Text Search
+
 This extension to ARQ combines SPARQL and full text search via
 [Lucene](https://lucene.apache.org) 6.4.1 or
 [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -231,7 +233,7 @@
 
 The most general form is:

- (?s ?score ?literal ?g) text:query (property 'query string' limit 
'lang:xx')
+ ( ?s ?score ?literal ?g ) text:query ( property 'query string' limit 
'lang:xx' 'highlight:yy' )
 
  Input arguments:
 
@@ -241,7 +243,7 @@
 | query string  | Lucene query string fragment   |
 | limit | (optional) `int` limit on the number of results   |
 | lang:xx   | (optional) language tag spec   |
-| highlight:xx  | (optional) highlighting options|
+| highlight:yy  | (optional) highlighting options|
 
 The `property` URI is only necessary if multiple properties have been
 indexed and the property being searched over is not the [default field
@@ -258,9 +260,9 @@
 indexed with the tag _xx_. Searches may be restricted to field values with no 
 language tag via `"lang:none"`. 
 
-The `highlight:xx` specification is an optional string where _xx_ are options 
that control the highlighting of search result literals. See 
[below](#highlighting) for details.
+The `highlight:yy` specification is an optional string where _yy_ are options 
that control the highlighting of search result literals. See 
[below](#highlighting) for details.
 
-If both `limit` and one or more of `lang:xx` or `highlight:xx` are present, 
then `limit` must precede these arguments.
+If both `limit` and one or more of `lang:xx` or `highlight:yy` are present, 
then `limit` must precede these arguments.
 
 If only the query string is required, the surrounding `( )` _may be_ omitted.
 
@@ -521,7 +523,7 @@
 
 "the quick ↦brown fox↤ jumped over the lazy baboon"
 
-The `RIGHT_ARROW` is Unicode \u21a6 and the `LEFT_ARROW` is Unicode \u21a4. 
These are chosen to be single characters that in most situations will be very 
unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be 
large enough that in many situations the matches will result in single 
fragments. If the literal is larger than 128 characters and there are several 
matches in the literal then there may be additional fragments separated by the 
`DIVIDES`, Unicode \u2223.
+The `RIGHT_ARROW` is Unicode, \u21a6, and the `LEFT_ARROW` is Unicode, \u21a4. 
These are chosen to be single characters that in most situations will be very 
unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be 
large enough that in many situations the matches will result in single 
fragments. If the literal is larger than 128 characters and there are several 
matches in the literal then there may be additional fragments separated by the 
`DIVIDES`, Unicode, \u2223.
 
 Depending on the analyzer used and the tokenizer, the highlighting will result 
in marking each token rather than an entire phrase. The `joinHi` option is by 
default `true` so that entire phrases are highlighted together rather than as 
individual tokens as in:
 



Re: CMS diff: Jena Full Text Search

2018-01-21 Thread Andy Seaborne

Thank you!
Changes applied.

Andy

On 20/01/18 21:21, Chris Tomlinson wrote:

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1821724)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -2,6 +2,8 @@
  
  Title: Jena Full Text Search
  
+Title: Jena Full Text Search

+
  This extension to ARQ combines SPARQL and full text search via
  [Lucene](https://lucene.apache.org) 6.4.1 or
  [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -80,6 +82,7 @@
  -   [Queries with graphs](#queries-with-graphs)
  -   [Queries across multiple 
`Fields`](#queries-across-multiple-fields)
  -   [Queries with _Boolean Operators_ and _Term 
Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
+-   [Highlighting](#highlighting)
  -   [Good practice](#good-practice)
  -   [Configuration](#configuration)
  -   [Text Dataset Assembler](#text-dataset-assembler)
@@ -242,6 +245,7 @@
  | query string  | Lucene query string fragment   |
  | limit | (optional) `int` limit on the number of results   |
  | lang:xx   | (optional) language tag spec   |
+| highlight:xx  | (optional) highlighting options|
  
  The `property` URI is only necessary if multiple properties have been

  indexed and the property being searched over is not the [default field
@@ -258,8 +262,10 @@
  indexed with the tag _xx_. Searches may be restricted to field values with no
  language tag via `"lang:none"`.
  
-If both `limit` and `lang:xx` are present, then `limit` must precede `lang:xx`.

+The `highlight:xx` specification is an optional string where _xx_ are options 
that control the highlighting of search result literals. See 
[below](#highlighting) for details.
  
+If both `limit` and one or more of `lang:xx` or `highlight:xx` are present, then `limit` must precede these arguments.

+
  If only the query string is required, the surrounding `( )` _may be_ omitted.
  
   Output arguments:

@@ -495,7 +501,52 @@
  **Always surround the query string with `( )` if more than a single term or 
phrase
  are involved.**
  
+ Highlighting
  
+The highlighting option uses the Lucene `Highlighter` and `SimpleHTMLFormatter` to insert highlighting markup into the literals returned from search results (hence the text dataset must be configured to store the literals). The highlighted results are returned via the _literal_ output argument.

+
+The simplest way to request highlighting is via `'highlight:'`. This will 
apply all the defaults:
+
+| Option | Key | Default |
+||-|-|
+| maxFrags | m: | 3 |
+| fragSize | z: |  128 |
+| start | s: | RIGHT_ARROW |
+| end | e: | LEFT_ARROW |
+| fragSep | f: | DIVIDES |
+| joinHi | jh: | true |
+| joinFrags | jf: | true |
+
+to the highlighting of the search results. For example if the query is:
+
+(?s ?sc ?lit) text:query ( "brown fox" "highlight:" )
+
+then a resulting literal binding might be:
+
+"the quick ↦brown fox↤ jumped over the lazy baboon"
+
+The `RIGHT_ARROW` is Unicode \u21a6 and the `LEFT_ARROW` is Unicode \u21a4. 
These are chosen to be single characters that in most situations will be very 
unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be 
large enough that in many situations the matches will result in single 
fragments. If the literal is larger than 128 characters and there are several 
matches in the literal then there may be additional fragments separated by the 
`DIVIDES`, Unicode \u2223.
+
+Depending on the analyzer used and the tokenizer, the highlighting will result 
in marking each token rather than an entire phrase. The `joinHi` option is by 
default `true` so that entire phrases are highlighted together rather than as 
individual tokens as in:
+
+"the quick ↦brown↤ ↦fox↤ jumped over the lazy baboon"
+
+which would result from:
+
+(?s ?sc ?lit) text:query ( "brown fox" "highlight:jh:n" )
+
+The `jh` and `jf` boolean options are set `false` via `n`. Any other value is 
`true`. The defaults for these options have been selected to be reasonable for 
most applications.
+
+The joining is performed post highlighting via Java `String replaceAll` rather 
than using the Lucene Unified Highlighter facility which requires that term 
vectors and positions be stored. The joining deletes _extra_ highlighting with 
only intervening Unicode separators, `\p{Z}`.
+
+The more conventional output of the Lucene `SimpleHTMLFormatter` with html emphasis markup is 
achieved via, `"highlight:s: | e:"` (highlight options 
are separated by a Unicode vertical line, \u007c. The spaces are 

CMS diff: Jena Full Text Search

2018-01-20 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1821724)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -2,6 +2,8 @@
 
 Title: Jena Full Text Search
 
+Title: Jena Full Text Search
+
 This extension to ARQ combines SPARQL and full text search via
 [Lucene](https://lucene.apache.org) 6.4.1 or
 [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -80,6 +82,7 @@
 -   [Queries with graphs](#queries-with-graphs)
 -   [Queries across multiple `Fields`](#queries-across-multiple-fields)
 -   [Queries with _Boolean Operators_ and _Term 
Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
+-   [Highlighting](#highlighting)
 -   [Good practice](#good-practice)
 -   [Configuration](#configuration)
 -   [Text Dataset Assembler](#text-dataset-assembler)
@@ -242,6 +245,7 @@
 | query string  | Lucene query string fragment   |
 | limit | (optional) `int` limit on the number of results   |
 | lang:xx   | (optional) language tag spec   |
+| highlight:xx  | (optional) highlighting options|
 
 The `property` URI is only necessary if multiple properties have been
 indexed and the property being searched over is not the [default field
@@ -258,8 +262,10 @@
 indexed with the tag _xx_. Searches may be restricted to field values with no 
 language tag via `"lang:none"`. 
 
-If both `limit` and `lang:xx` are present, then `limit` must precede `lang:xx`.
+The `highlight:xx` specification is an optional string where _xx_ are options 
that control the highlighting of search result literals. See 
[below](#highlighting) for details.
 
+If both `limit` and one or more of `lang:xx` or `highlight:xx` are present, 
then `limit` must precede these arguments.
+
 If only the query string is required, the surrounding `( )` _may be_ omitted.
 
  Output arguments:
@@ -495,7 +501,52 @@
 **Always surround the query string with `( )` if more than a single term or 
phrase
 are involved.**
 
+ Highlighting
 
+The highlighting option uses the Lucene `Highlighter` and 
`SimpleHTMLFormatter` to insert highlighting markup into the literals returned 
from search results (hence the text dataset must be configured to store the 
literals). The highlighted results are returned via the _literal_ output 
argument.
+
+The simplest way to request highlighting is via `'highlight:'`. This will 
apply all the defaults:
+
+| Option | Key | Default |
+||-|-|
+| maxFrags | m: | 3 |
+| fragSize | z: |  128 |
+| start | s: | RIGHT_ARROW |
+| end | e: | LEFT_ARROW |
+| fragSep | f: | DIVIDES |
+| joinHi | jh: | true |
+| joinFrags | jf: | true |
+
+to the highlighting of the search results. For example if the query is:
+
+(?s ?sc ?lit) text:query ( "brown fox" "highlight:" ) 
+
+then a resulting literal binding might be:
+
+"the quick ↦brown fox↤ jumped over the lazy baboon"
+
+The `RIGHT_ARROW` is Unicode \u21a6 and the `LEFT_ARROW` is Unicode \u21a4. 
These are chosen to be single characters that in most situations will be very 
unlikely to occur in resulting literals. The `fragSize` of 128 is chosen to be 
large enough that in many situations the matches will result in single 
fragments. If the literal is larger than 128 characters and there are several 
matches in the literal then there may be additional fragments separated by the 
`DIVIDES`, Unicode \u2223.
+
+Depending on the analyzer used and the tokenizer, the highlighting will result 
in marking each token rather than an entire phrase. The `joinHi` option is by 
default `true` so that entire phrases are highlighted together rather than as 
individual tokens as in:
+
+"the quick ↦brown↤ ↦fox↤ jumped over the lazy baboon"
+
+which would result from:
+
+(?s ?sc ?lit) text:query ( "brown fox" "highlight:jh:n" )
+
+The `jh` and `jf` boolean options are set `false` via `n`. Any other value is 
`true`. The defaults for these options have been selected to be reasonable for 
most applications.
+
+The joining is performed post highlighting via Java `String replaceAll` rather 
than using the Lucene Unified Highlighter facility which requires that term 
vectors and positions be stored. The joining deletes _extra_ highlighting with 
only intervening Unicode separators, `\p{Z}`.
+
+The more conventional output of the Lucene `SimpleHTMLFormatter` with html 
emphasis markup is achieved via, `"highlight:s: | e:"` 
(highlight options are separated by a Unicode vertical line, \u007c. The spaces 
are not necessary). The result with the above example will be:
+
+"the quick brown fox jumped over the 

CMS diff: Jena Full Text Search

2018-01-03 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1819918)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -228,10 +228,11 @@
 ?s text:query (rdfs:label 'protégé' 'lang:fr')# restrict search to 
French
 (?s ?score) text:query 'word' # query capturing also 
the score
 (?s ?score ?literal) text:query 'word'# ... and original 
literal value
+(?s ?score ?literal ?g) text:query 'word' # ... and the graph
 
 The most general form is:

- (?s ?score ?literal) text:query (property 'query string' limit 'lang:xx')
+ (?s ?score ?literal ?g) text:query (property 'query string' limit 
'lang:xx')
 
  Input arguments:
 
@@ -268,17 +269,15 @@
 | subject URI   | The subject of the indexed RDF triple.  |
 | score | (optional) The score for the match. |
 | literal   | (optional) The matched object literal. |
+| graph URI | (optional) The graph URI of the triple. |
 
 The results include the _subject URI_; the _score_ assigned by the
 text search engine; and the entire matched _literal_ (if the index has
 been [configured to store literal values](#text-dataset-assembler)).
 The _subject URI_ may be a variable, e.g., `?s`, or a _URI_. In the
 latter case the search is restricted to triples with the specified
-subject. The _score_ and the _literal_ **must** be variables.
+subject. The _score_, _literal_ and _graph URI_ **must** be variables.
 
-If only the _subject_ variable, `?s` is needed then it **must be** written 
without 
-surrounding `( )`; otherwise, an error is signalled.
-
 ### Query strings
 
 There are several points that need to be considered when formulating
@@ -401,6 +400,17 @@
   graph ?g { ?s a ex:Item } .
 }
 
+Further, if `tdb:unionDefaultGraph true` for a TDB dataset backing a Lucene 
index then it is possible to retrieve the graphs that contain triples resulting 
from a Lucene search via the fourth output argument to `text:query`:
+
+select ?g ?s ?lit
+where {
+  (?s ?sc ?lit ?g) text:query "zorn" .
+}
+
+This will generally perform much better than either of the previous approaches 
when there are
+large numbers of graphs since the Lucene search will run once and the returned 
_documents_ carry
+the containing graph URI for free as it were.
+
  Queries across multiple `Field`s
 
 As mentioned earlier, the text index uses the



Re: CMS diff: Jena Full Text Search

2017-12-10 Thread Andy Seaborne

Done - thanks

Andy

On 09/12/17 17:48, Chris Tomlinson wrote:

Hello,

This commit, against the staging version of jena-text doc, corrects the documentation 
to reflect fix JENA-1439 graph queries fail 'lang:xx’ 
.

Thank you,
Chris



On Dec 9, 2017, at 5:45 PM, Chris Tomlinson  wrote:

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1817587)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -391,11 +391,6 @@
will iterate over the graphs in the dataset, searching each in turn for
matches.

-Note that there is a known issue when a `lang:xx` argument is included in
-the above pattern, so that the restriction to given language is not obeyed.
-This will be corrected in a future release. However, use of a language tag
-on the `query string` is not subject to this issue.
-
If there is suitable structure to the graphs, e.g., a known `rdf:type` and
depending on the selectivity of the text query and number of graphs,
it may be more performant to express the query as follows:
@@ -406,9 +401,6 @@
   graph ?g { ?s a ex:Item } .
 }

-Note that this form does not have any issue with `lang:xx` as described
-above, since the graph is extracted after the text search.
-
 Queries across multiple `Field`s

As mentioned earlier, the text index uses the






Re: CMS diff: Jena Full Text Search

2017-12-09 Thread Chris Tomlinson
Hello,

This commit, against the staging version of jena-text doc, corrects the 
documentation to reflect fix JENA-1439 graph queries fail 'lang:xx’ 
.

Thank you,
Chris


> On Dec 9, 2017, at 5:45 PM, Chris Tomlinson  wrote:
> 
> Clone URL (Committers only):
> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
> 
> Chris Tomlinson
> 
> Index: trunk/content/documentation/query/text-query.mdtext
> ===
> --- trunk/content/documentation/query/text-query.mdtext   (revision 
> 1817587)
> +++ trunk/content/documentation/query/text-query.mdtext   (working copy)
> @@ -391,11 +391,6 @@
> will iterate over the graphs in the dataset, searching each in turn for
> matches.
> 
> -Note that there is a known issue when a `lang:xx` argument is included in
> -the above pattern, so that the restriction to given language is not obeyed. 
> -This will be corrected in a future release. However, use of a language tag
> -on the `query string` is not subject to this issue.
> -
> If there is suitable structure to the graphs, e.g., a known `rdf:type` and
> depending on the selectivity of the text query and number of graphs, 
> it may be more performant to express the query as follows:
> @@ -406,9 +401,6 @@
>   graph ?g { ?s a ex:Item } .
> }
> 
> -Note that this form does not have any issue with `lang:xx` as described
> -above, since the graph is extracted after the text search.
> -
>  Queries across multiple `Field`s
> 
> As mentioned earlier, the text index uses the
> 



CMS diff: Jena Full Text Search

2017-12-09 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1817587)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -391,11 +391,6 @@
 will iterate over the graphs in the dataset, searching each in turn for
 matches.
 
-Note that there is a known issue when a `lang:xx` argument is included in
-the above pattern, so that the restriction to given language is not obeyed. 
-This will be corrected in a future release. However, use of a language tag
-on the `query string` is not subject to this issue.
-
 If there is suitable structure to the graphs, e.g., a known `rdf:type` and
 depending on the selectivity of the text query and number of graphs, 
 it may be more performant to express the query as follows:
@@ -406,9 +401,6 @@
   graph ?g { ?s a ex:Item } .
 }
 
-Note that this form does not have any issue with `lang:xx` as described
-above, since the graph is extracted after the text search.
-
  Queries across multiple `Field`s
 
 As mentioned earlier, the text index uses the



Re: CMS diff: Jena Full Text Search

2017-12-04 Thread Andy Seaborne

Changes applied and on jena.staging.apache.org.

Thank you

Andy



On 02/12/17 19:26, Andy Seaborne wrote:



On 01/12/17 16:26, Chris Tomlinson wrote:

Hi Andy,

The current commit is cumulative. The commit just prior to this one 
addressed all of Osma’s comments - which included my raising JENA-1437 
 and JENA-1438 
. This commit 
contains my changes to reflect those two issues as being fixed along 
with all my prior updates.


If there is a better procedure for making a series of updates to docs 
under CMS I’m happy to learn.


This is fine - I was checking it was what it looked like before applying 
it due to limited familiarity with the text indexing.


     Andy



Thank you for your help with this,
Chris



On Dec 1, 2017, at 6:10 AM, Andy Seaborne  wrote:

Chris - does this contain the previous one?

If so, are Osma's comments resolved?

If it's all good to go, I'll apply it because chnages can continue 
and are not frozen on a release.  I haven't had the time to check 
through the proposed changes and I'm not deeply familiar with the 
text indexing so I'm relying on others to verify them.


    Andy

On 30/11/17 17:28, Chris Tomlinson wrote:
This commit updates the jena-text documentation to be consistent 
with the resolved JENA-1437 and JENA-1438 issues.

Regards,
Chris
On Nov 30, 2017, at 5:00 PM, Chris Tomlinson  
wrote:


Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext 



Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext
(revision 1816662)
+++ trunk/content/documentation/query/text-query.mdtext    (working 
copy)

@@ -1,5 +1,7 @@
Title: Jena Full Text Search

+Title: Jena Full Text Search
+
This extension to ARQ combines SPARQL and full text search via
[Lucene](https://lucene.apache.org) 6.4.1 or
[ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -64,7 +66,21 @@
## Table of Contents

-   [Architecture](#architecture)
+    -   [External content](#external-content)
+    -   [External applications](#external-applications)
+    -   [Document structure](#document-structure)
-   [Query with SPARQL](#query-with-sparql)
+    -   [Syntax](#syntax)
+    -   [Input arguments](#input-arguments)
+    -   [Output arguments](#output-arguments)
+    -   [Query strings](#query-strings)
+    -   [Simple queries](#simple-queries)
+    -   [Queries with language tags](#queries-with-language-tags)
+    -   [Queries that retrieve 
literals](#queries-that-retrieve-literals)

+    -   [Queries with graphs](#queries-with-graphs)
+    -   [Queries across multiple 
`Fields`](#queries-across-multiple-fields)
+    -   [Queries with _Boolean Operators_ and _Term 
Modifiers_](#queries-with-boolean-operators-and-term-modifiers)

+    -   [Good practice](#good-practice)
-   [Configuration](#configuration)
 -   [Text Dataset Assembler](#text-dataset-assembler)
 -   [Configuring an analyzer](#configuring-an-analyzer)
@@ -108,6 +124,7 @@

The text index uses the native query language of the index:
[Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description) 


+(with [restrictions](#input-arguments))
or
[Elasticsearch query 
language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html). 



@@ -134,6 +151,64 @@
By using Elasticsearch, other applications can share the text index 
with

SPARQL search.

+### Document structure
+
+As mentioned above, text indexing of a triple involves associating 
a Lucene

+document with the triple. How is this done?
+
+Lucene documents are composed of `Field`s. Indexing and searching 
are performed
+over the contents of these `Field`s. For an RDF triple to be 
indexed in Lucene the

+_property_ of the triple must be
+[configured in the entity map of a 
TextIndex](#entity-map-definition).
+This associates a Lucene analyzer with the _`property`_ which will 
be used
+for indexing and search. The _`property`_ becomes the _searchable_ 
Lucene

+`Field` in the resulting document.
+
+A Lucene index includes a _default_ `Field`, which is specified in 
the configuration,
+that is the field to search if not otherwise named in the query. 
In jena-text
+this field is configured via the `text:defaultField` property 
which is then mapped
+to a specific RDF property via `text:predicate` (see [entity 
map](#entity-map-definition)

+below).
+
+There are several additional `Field`s that will be included in the
+document that is passed to the Lucene `IndexWriter` depending on the
+configuration options that are used. These additional fields 

Re: CMS diff: Jena Full Text Search

2017-12-02 Thread Andy Seaborne



On 01/12/17 16:26, Chris Tomlinson wrote:

Hi Andy,

The current commit is cumulative. The commit just prior to this one addressed all of Osma’s 
comments - which included my raising JENA-1437 
 and JENA-1438 
. This commit contains my changes to 
reflect those two issues as being fixed along with all my prior updates.

If there is a better procedure for making a series of updates to docs under CMS 
I’m happy to learn.


This is fine - I was checking it was what it looked like before applying 
it due to limited familiarity with the text indexing.


Andy



Thank you for your help with this,
Chris



On Dec 1, 2017, at 6:10 AM, Andy Seaborne  wrote:

Chris - does this contain the previous one?

If so, are Osma's comments resolved?

If it's all good to go, I'll apply it because chnages can continue and are not 
frozen on a release.  I haven't had the time to check through the proposed 
changes and I'm not deeply familiar with the text indexing so I'm relying on 
others to verify them.

Andy

On 30/11/17 17:28, Chris Tomlinson wrote:

This commit updates the jena-text documentation to be consistent with the 
resolved JENA-1437 and JENA-1438 issues.
Regards,
Chris

On Nov 30, 2017, at 5:00 PM, Chris Tomlinson  wrote:

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1816662)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
Title: Jena Full Text Search

+Title: Jena Full Text Search
+
This extension to ARQ combines SPARQL and full text search via
[Lucene](https://lucene.apache.org) 6.4.1 or
[ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -64,7 +66,21 @@
## Table of Contents

-   [Architecture](#architecture)
+-   [External content](#external-content)
+-   [External applications](#external-applications)
+-   [Document structure](#document-structure)
-   [Query with SPARQL](#query-with-sparql)
+-   [Syntax](#syntax)
+-   [Input arguments](#input-arguments)
+-   [Output arguments](#output-arguments)
+-   [Query strings](#query-strings)
+-   [Simple queries](#simple-queries)
+-   [Queries with language tags](#queries-with-language-tags)
+-   [Queries that retrieve literals](#queries-that-retrieve-literals)
+-   [Queries with graphs](#queries-with-graphs)
+-   [Queries across multiple `Fields`](#queries-across-multiple-fields)
+-   [Queries with _Boolean Operators_ and _Term 
Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
+-   [Good practice](#good-practice)
-   [Configuration](#configuration)
 -   [Text Dataset Assembler](#text-dataset-assembler)
 -   [Configuring an analyzer](#configuring-an-analyzer)
@@ -108,6 +124,7 @@

The text index uses the native query language of the index:
[Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+(with [restrictions](#input-arguments))
or
[Elasticsearch query 
language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).

@@ -134,6 +151,64 @@
By using Elasticsearch, other applications can share the text index with
SPARQL search.

+### Document structure
+
+As mentioned above, text indexing of a triple involves associating a Lucene
+document with the triple. How is this done?
+
+Lucene documents are composed of `Field`s. Indexing and searching are performed
+over the contents of these `Field`s. For an RDF triple to be indexed in Lucene 
the
+_property_ of the triple must be
+[configured in the entity map of a TextIndex](#entity-map-definition).
+This associates a Lucene analyzer with the _`property`_ which will be used
+for indexing and search. The _`property`_ becomes the _searchable_ Lucene
+`Field` in the resulting document.
+
+A Lucene index includes a _default_ `Field`, which is specified in the 
configuration,
+that is the field to search if not otherwise named in the query. In jena-text
+this field is configured via the `text:defaultField` property which is then 
mapped
+to a specific RDF property via `text:predicate` (see [entity 
map](#entity-map-definition)
+below).
+
+There are several additional `Field`s that will be included in the
+document that is passed to the Lucene `IndexWriter` depending on the
+configuration options that are used. These additional fields are used to
+manage the interface between Jena and Lucene and are not generally
+searchable per se.
+
+The most important of these additional `Field`s is the 

Re: CMS diff: Jena Full Text Search

2017-12-01 Thread Chris Tomlinson
Hi Andy,

The current commit is cumulative. The commit just prior to this one addressed 
all of Osma’s comments - which included my raising JENA-1437 
 and JENA-1438 
. This commit contains my 
changes to reflect those two issues as being fixed along with all my prior 
updates.

If there is a better procedure for making a series of updates to docs under CMS 
I’m happy to learn.

Thank you for your help with this,
Chris


> On Dec 1, 2017, at 6:10 AM, Andy Seaborne  wrote:
> 
> Chris - does this contain the previous one?
> 
> If so, are Osma's comments resolved?
> 
> If it's all good to go, I'll apply it because chnages can continue and are 
> not frozen on a release.  I haven't had the time to check through the 
> proposed changes and I'm not deeply familiar with the text indexing so I'm 
> relying on others to verify them.
> 
>Andy
> 
> On 30/11/17 17:28, Chris Tomlinson wrote:
>> This commit updates the jena-text documentation to be consistent with the 
>> resolved JENA-1437 and JENA-1438 issues.
>> Regards,
>> Chris
>>> On Nov 30, 2017, at 5:00 PM, Chris Tomlinson  wrote:
>>> 
>>> Clone URL (Committers only):
>>> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
>>> 
>>> Chris Tomlinson
>>> 
>>> Index: trunk/content/documentation/query/text-query.mdtext
>>> ===
>>> --- trunk/content/documentation/query/text-query.mdtext (revision 
>>> 1816662)
>>> +++ trunk/content/documentation/query/text-query.mdtext (working copy)
>>> @@ -1,5 +1,7 @@
>>> Title: Jena Full Text Search
>>> 
>>> +Title: Jena Full Text Search
>>> +
>>> This extension to ARQ combines SPARQL and full text search via
>>> [Lucene](https://lucene.apache.org) 6.4.1 or
>>> [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
>>> @@ -64,7 +66,21 @@
>>> ## Table of Contents
>>> 
>>> -   [Architecture](#architecture)
>>> +-   [External content](#external-content)
>>> +-   [External applications](#external-applications)
>>> +-   [Document structure](#document-structure)
>>> -   [Query with SPARQL](#query-with-sparql)
>>> +-   [Syntax](#syntax)
>>> +-   [Input arguments](#input-arguments)
>>> +-   [Output arguments](#output-arguments)
>>> +-   [Query strings](#query-strings)
>>> +-   [Simple queries](#simple-queries)
>>> +-   [Queries with language tags](#queries-with-language-tags)
>>> +-   [Queries that retrieve 
>>> literals](#queries-that-retrieve-literals)
>>> +-   [Queries with graphs](#queries-with-graphs)
>>> +-   [Queries across multiple 
>>> `Fields`](#queries-across-multiple-fields)
>>> +-   [Queries with _Boolean Operators_ and _Term 
>>> Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
>>> +-   [Good practice](#good-practice)
>>> -   [Configuration](#configuration)
>>> -   [Text Dataset Assembler](#text-dataset-assembler)
>>> -   [Configuring an analyzer](#configuring-an-analyzer)
>>> @@ -108,6 +124,7 @@
>>> 
>>> The text index uses the native query language of the index:
>>> [Lucene query 
>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
>>> +(with [restrictions](#input-arguments))
>>> or
>>> [Elasticsearch query 
>>> language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
>>> 
>>> @@ -134,6 +151,64 @@
>>> By using Elasticsearch, other applications can share the text index with
>>> SPARQL search.
>>> 
>>> +### Document structure
>>> +
>>> +As mentioned above, text indexing of a triple involves associating a Lucene
>>> +document with the triple. How is this done?
>>> +
>>> +Lucene documents are composed of `Field`s. Indexing and searching are 
>>> performed
>>> +over the contents of these `Field`s. For an RDF triple to be indexed in 
>>> Lucene the
>>> +_property_ of the triple must be
>>> +[configured in the entity map of a TextIndex](#entity-map-definition).
>>> +This associates a Lucene analyzer with the _`property`_ which will be used
>>> +for indexing and search. The _`property`_ becomes the _searchable_ Lucene
>>> +`Field` in the resulting document.
>>> +
>>> +A Lucene index includes a _default_ `Field`, which is specified in the 
>>> configuration,
>>> +that is the field to search if not otherwise named in the query. In 
>>> jena-text
>>> +this field is configured via the `text:defaultField` property which is 
>>> then mapped
>>> +to a specific RDF property via `text:predicate` (see [entity 
>>> map](#entity-map-definition)
>>> +below).
>>> +
>>> +There are several additional `Field`s that will be included in the
>>> +document that is passed to the Lucene `IndexWriter` depending on the
>>> 

Re: CMS diff: Jena Full Text Search

2017-12-01 Thread Andy Seaborne

Chris - does this contain the previous one?

If so, are Osma's comments resolved?

If it's all good to go, I'll apply it because chnages can continue and 
are not frozen on a release.  I haven't had the time to check through 
the proposed changes and I'm not deeply familiar with the text indexing 
so I'm relying on others to verify them.


Andy

On 30/11/17 17:28, Chris Tomlinson wrote:

This commit updates the jena-text documentation to be consistent with the 
resolved JENA-1437 and JENA-1438 issues.

Regards,
Chris


On Nov 30, 2017, at 5:00 PM, Chris Tomlinson  wrote:

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1816662)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
Title: Jena Full Text Search

+Title: Jena Full Text Search
+
This extension to ARQ combines SPARQL and full text search via
[Lucene](https://lucene.apache.org) 6.4.1 or
[ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -64,7 +66,21 @@
## Table of Contents

-   [Architecture](#architecture)
+-   [External content](#external-content)
+-   [External applications](#external-applications)
+-   [Document structure](#document-structure)
-   [Query with SPARQL](#query-with-sparql)
+-   [Syntax](#syntax)
+-   [Input arguments](#input-arguments)
+-   [Output arguments](#output-arguments)
+-   [Query strings](#query-strings)
+-   [Simple queries](#simple-queries)
+-   [Queries with language tags](#queries-with-language-tags)
+-   [Queries that retrieve literals](#queries-that-retrieve-literals)
+-   [Queries with graphs](#queries-with-graphs)
+-   [Queries across multiple `Fields`](#queries-across-multiple-fields)
+-   [Queries with _Boolean Operators_ and _Term 
Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
+-   [Good practice](#good-practice)
-   [Configuration](#configuration)
 -   [Text Dataset Assembler](#text-dataset-assembler)
 -   [Configuring an analyzer](#configuring-an-analyzer)
@@ -108,6 +124,7 @@

The text index uses the native query language of the index:
[Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+(with [restrictions](#input-arguments))
or
[Elasticsearch query 
language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).

@@ -134,6 +151,64 @@
By using Elasticsearch, other applications can share the text index with
SPARQL search.

+### Document structure
+
+As mentioned above, text indexing of a triple involves associating a Lucene
+document with the triple. How is this done?
+
+Lucene documents are composed of `Field`s. Indexing and searching are performed
+over the contents of these `Field`s. For an RDF triple to be indexed in Lucene 
the
+_property_ of the triple must be
+[configured in the entity map of a TextIndex](#entity-map-definition).
+This associates a Lucene analyzer with the _`property`_ which will be used
+for indexing and search. The _`property`_ becomes the _searchable_ Lucene
+`Field` in the resulting document.
+
+A Lucene index includes a _default_ `Field`, which is specified in the 
configuration,
+that is the field to search if not otherwise named in the query. In jena-text
+this field is configured via the `text:defaultField` property which is then 
mapped
+to a specific RDF property via `text:predicate` (see [entity 
map](#entity-map-definition)
+below).
+
+There are several additional `Field`s that will be included in the
+document that is passed to the Lucene `IndexWriter` depending on the
+configuration options that are used. These additional fields are used to
+manage the interface between Jena and Lucene and are not generally
+searchable per se.
+
+The most important of these additional `Field`s is the `text:entityField`.
+This configuration property defines the name of the `Field` that will contain
+the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. 
This property does
+not have a default and must be specified for most uses of `jena-text`. This
+`Field` is often given the name, `uri`, in examples. It is via this `Field`
+that `?s` is bound in a typical use such as:
+
+select ?s
+where {
+?s text:query "some text"
+}
+
+Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
+and so on are discussed below.
+
+Given the triple:
+
+ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
+
+The following is an abbreviated illustration a Lucene document that Jena will 
create and
+request Lucene to index:

Re: CMS diff: Jena Full Text Search

2017-11-30 Thread Chris Tomlinson
This commit updates the jena-text documentation to be consistent with the 
resolved JENA-1437 and JENA-1438 issues.

Regards,
Chris

> On Nov 30, 2017, at 5:00 PM, Chris Tomlinson  wrote:
> 
> Clone URL (Committers only):
> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
> 
> Chris Tomlinson
> 
> Index: trunk/content/documentation/query/text-query.mdtext
> ===
> --- trunk/content/documentation/query/text-query.mdtext   (revision 
> 1816662)
> +++ trunk/content/documentation/query/text-query.mdtext   (working copy)
> @@ -1,5 +1,7 @@
> Title: Jena Full Text Search
> 
> +Title: Jena Full Text Search
> +
> This extension to ARQ combines SPARQL and full text search via
> [Lucene](https://lucene.apache.org) 6.4.1 or
> [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
> @@ -64,7 +66,21 @@
> ## Table of Contents
> 
> -   [Architecture](#architecture)
> +-   [External content](#external-content)
> +-   [External applications](#external-applications)
> +-   [Document structure](#document-structure)
> -   [Query with SPARQL](#query-with-sparql)
> +-   [Syntax](#syntax)
> +-   [Input arguments](#input-arguments)
> +-   [Output arguments](#output-arguments)
> +-   [Query strings](#query-strings)
> +-   [Simple queries](#simple-queries)
> +-   [Queries with language tags](#queries-with-language-tags)
> +-   [Queries that retrieve literals](#queries-that-retrieve-literals)
> +-   [Queries with graphs](#queries-with-graphs)
> +-   [Queries across multiple 
> `Fields`](#queries-across-multiple-fields)
> +-   [Queries with _Boolean Operators_ and _Term 
> Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
> +-   [Good practice](#good-practice)
> -   [Configuration](#configuration)
> -   [Text Dataset Assembler](#text-dataset-assembler)
> -   [Configuring an analyzer](#configuring-an-analyzer)
> @@ -108,6 +124,7 @@
> 
> The text index uses the native query language of the index:
> [Lucene query 
> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
> +(with [restrictions](#input-arguments))
> or
> [Elasticsearch query 
> language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
> 
> @@ -134,6 +151,64 @@
> By using Elasticsearch, other applications can share the text index with
> SPARQL search.
> 
> +### Document structure
> +
> +As mentioned above, text indexing of a triple involves associating a Lucene
> +document with the triple. How is this done?
> +
> +Lucene documents are composed of `Field`s. Indexing and searching are 
> performed 
> +over the contents of these `Field`s. For an RDF triple to be indexed in 
> Lucene the 
> +_property_ of the triple must be 
> +[configured in the entity map of a TextIndex](#entity-map-definition).
> +This associates a Lucene analyzer with the _`property`_ which will be used
> +for indexing and search. The _`property`_ becomes the _searchable_ Lucene 
> +`Field` in the resulting document.
> +
> +A Lucene index includes a _default_ `Field`, which is specified in the 
> configuration, 
> +that is the field to search if not otherwise named in the query. In 
> jena-text 
> +this field is configured via the `text:defaultField` property which is then 
> mapped 
> +to a specific RDF property via `text:predicate` (see [entity 
> map](#entity-map-definition) 
> +below).
> +
> +There are several additional `Field`s that will be included in the
> +document that is passed to the Lucene `IndexWriter` depending on the
> +configuration options that are used. These additional fields are used to
> +manage the interface between Jena and Lucene and are not generally 
> +searchable per se.
> +
> +The most important of these additional `Field`s is the `text:entityField`.
> +This configuration property defines the name of the `Field` that will contain
> +the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. 
> This property does
> +not have a default and must be specified for most uses of `jena-text`. This
> +`Field` is often given the name, `uri`, in examples. It is via this `Field`
> +that `?s` is bound in a typical use such as:
> +
> +select ?s
> +where {
> +?s text:query "some text"
> +}
> +
> +Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
> +and so on are discussed below.
> +
> +Given the triple:
> +
> +ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
> +
> +The following is an abbreviated illustration a Lucene document that Jena 
> will create and
> +request Lucene to index:
> +
> +Document<
> + 
> + 
> + 
> + 
> +
>  
> +>
> +
> +It may 

CMS diff: Jena Full Text Search

2017-11-30 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1816662)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
 Title: Jena Full Text Search
 
+Title: Jena Full Text Search
+
 This extension to ARQ combines SPARQL and full text search via
 [Lucene](https://lucene.apache.org) 6.4.1 or
 [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -64,7 +66,21 @@
 ## Table of Contents
 
 -   [Architecture](#architecture)
+-   [External content](#external-content)
+-   [External applications](#external-applications)
+-   [Document structure](#document-structure)
 -   [Query with SPARQL](#query-with-sparql)
+-   [Syntax](#syntax)
+-   [Input arguments](#input-arguments)
+-   [Output arguments](#output-arguments)
+-   [Query strings](#query-strings)
+-   [Simple queries](#simple-queries)
+-   [Queries with language tags](#queries-with-language-tags)
+-   [Queries that retrieve literals](#queries-that-retrieve-literals)
+-   [Queries with graphs](#queries-with-graphs)
+-   [Queries across multiple `Fields`](#queries-across-multiple-fields)
+-   [Queries with _Boolean Operators_ and _Term 
Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
+-   [Good practice](#good-practice)
 -   [Configuration](#configuration)
 -   [Text Dataset Assembler](#text-dataset-assembler)
 -   [Configuring an analyzer](#configuring-an-analyzer)
@@ -108,6 +124,7 @@
 
 The text index uses the native query language of the index:
 [Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+(with [restrictions](#input-arguments))
 or
 [Elasticsearch query 
language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
 
@@ -134,6 +151,64 @@
 By using Elasticsearch, other applications can share the text index with
 SPARQL search.
 
+### Document structure
+
+As mentioned above, text indexing of a triple involves associating a Lucene
+document with the triple. How is this done?
+
+Lucene documents are composed of `Field`s. Indexing and searching are 
performed 
+over the contents of these `Field`s. For an RDF triple to be indexed in Lucene 
the 
+_property_ of the triple must be 
+[configured in the entity map of a TextIndex](#entity-map-definition).
+This associates a Lucene analyzer with the _`property`_ which will be used
+for indexing and search. The _`property`_ becomes the _searchable_ Lucene 
+`Field` in the resulting document.
+
+A Lucene index includes a _default_ `Field`, which is specified in the 
configuration, 
+that is the field to search if not otherwise named in the query. In jena-text 
+this field is configured via the `text:defaultField` property which is then 
mapped 
+to a specific RDF property via `text:predicate` (see [entity 
map](#entity-map-definition) 
+below).
+
+There are several additional `Field`s that will be included in the
+document that is passed to the Lucene `IndexWriter` depending on the
+configuration options that are used. These additional fields are used to
+manage the interface between Jena and Lucene and are not generally 
+searchable per se.
+
+The most important of these additional `Field`s is the `text:entityField`.
+This configuration property defines the name of the `Field` that will contain
+the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. 
This property does
+not have a default and must be specified for most uses of `jena-text`. This
+`Field` is often given the name, `uri`, in examples. It is via this `Field`
+that `?s` is bound in a typical use such as:
+
+select ?s
+where {
+?s text:query "some text"
+}
+
+Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
+and so on are discussed below.
+
+Given the triple:
+
+ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
+
+The following is an abbreviated illustration a Lucene document that Jena will 
create and
+request Lucene to index:
+
+Document<
+ 
+ 
+ 
+ 
+ 
+>
+
+It may be instructive to refer back to this example when considering the 
various
+points below.
+
 ## Query with SPARQL
 
 The URI of the text extension property function is
@@ -143,63 +218,292 @@
 
 ...   text:query ...
 
+### Syntax
 
 The following forms are all legal:
 
-?s text:query 'word'   # query
-?s text:query (rdfs:label 'word')  # query specific property if 
multiple
-?s text:query ('word' 10)  

Re: CMS diff: Jena Full Text Search

2017-11-27 Thread Chris Tomlinson
Hi All,

I’ve completed my proposed updates to the Jena text-query documentation. The 
documentation corresponds to 3.6.0-SNAPSHOT. I’ve noted several instances where 
the current behavior may be considered an issue that will be corrected in a 
future release. I’ve separately created issues for these: JENA-1437 
, JENA-1438 
, and JENA-1439 
.

Thank you,
Chris


> On Nov 22, 2017, at 8:25 AM, Chris Tomlinson  
> wrote:
> 
> Hi Andy and Osma,
> 
> I posted JENA-1426  since 
> the “improve this page” facility didn’t seem to offer any way to add a commit 
> message or more extensive explanation of the reasons for the proposed edits 
> and they were somewhat extensive. So raising an issue seemed a way to 
> proceed; however, after several days with no comments I thought perhaps I 
> should follow the published protocol and I made the update as guest on the 
> CMS.
> 
> I had several motivations regarding updating the documentation: 1) I wanted 
> to present how the current implementation functions in a way that might be 
> more useful to users - for example clarifying what can be expected to work 
> and what not in terms of using the native Lucene query language, e.g., 
> JENA-1388 ; 2) identify 
> areas that might indicate perhaps unintended aspects of the current 
> implementation; and 3) understand the code in preparation for developing a 
> proposal for adding jena-text highlighting support 
> .
> 
> Based on Osma’s feedback I will be opening a few issues on JIRA and making 
> corrections to the original submission. I assume that updates should just be 
> made as further commits.
> 
> Thanks,
> Chris
> 
> 
> 
>> On Nov 22, 2017, at 6:41 AM, Andy Seaborne > > wrote:
>> 
>> How is this related to JENA-1426?
>> 
>>Andy
>> 
>> On 21/11/17 14:48, Osma Suominen wrote:
>>> ajs6f kirjoitti 20.11.2017 klo 18:36:
 Osma (or anyone else who knows text indexing better than do I, which 
 wouldn't take much)-- could you review this? It's got some great useful 
 detail about how the indexing works and can be used.
>>> Sure, will do.
>>> Comments about specific sections below. Generally this is a very good 
>>> contribution to the jena-text documentation, which has stagnated a bit.
> +The following illustrates a Lucene document that Jena will create and
> +request Lucene to index:
> +
> +Document<
> +stored, indexed, indexOptions=DOCS 
> >
> +indexed, omitNorms, indexOptions=DOCS 
> 
> +stored, indexed, tokenized 
> +stored, indexed, omitNorms, indexOptions=DOCS 
> +stored, indexed, tokenized 
> +stored, indexed, omitNorms, indexOptions=DOCS 
> 
> +stored, indexed, tokenized 
> +stored, indexed, omitNorms, indexOptions=DOCS 
> +stored, indexed, tokenized 
> +stored, indexed, omitNorms, indexOptions=DOCS 
> 
> +>
> +
> +It may be instructive to refer back to this example when considering the 
> various
> +points below.
>>> Not sure if this is a perfect illustration. The level of detail is rather 
>>> excessive. I know Lucene quite well and I still struggle to understand 
>>> what's going on here. Is there another way of presenting this information, 
>>> for example just a key-value list that shows the field values that get 
>>> stored in the document? I think the field options stored, indexed, 
>>> tokenized, omitNorms etc. are unnecessary here or at least should not be so 
>>> prominent.
> +The `lang:xx` specification is an optional string, where _xx_ is
> +a BCP-47 language tag. This restricts searches to field values that were 
> originally
> +indexed with the tag _xx_. Searches may be restricted to field values 
> with no
> +language tag via `"lang:none"`. The use of the `lang:xx` is only 
> effective if
> +[multilingual support](#linguistic-support-with-lucene-index) has been 
> configured.
>>> The last sentence is not true. You can restrict by language even without 
>>> enabling multilingual support, as long as langField has been set.
> +Further, if the `lang:xx` is used then the `property` URI must be 
> supplied
> +in order for searches to work.
>>> Not true. The default property should be used if no property was specified.
> +When working with `rdf:langString`s It may be tempting to write:
> +
> +?s text:query "protégé"@fr
> +
> +However, the above will silently fail to return results since 

CMS diff: Jena Full Text Search

2017-11-27 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1816402)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
 Title: Jena Full Text Search
 
+Title: Jena Full Text Search
+
 This extension to ARQ combines SPARQL and full text search via
 [Lucene](https://lucene.apache.org) 6.4.1 or
 [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -64,7 +66,21 @@
 ## Table of Contents
 
 -   [Architecture](#architecture)
+-   [External content](#external-content)
+-   [External applications](#external-applications)
+-   [Document structure](#document-structure)
 -   [Query with SPARQL](#query-with-sparql)
+-   [Syntax](#syntax)
+-   [Input arguments](#input-arguments)
+-   [Output arguments](#output-arguments)
+-   [Query strings](#query-strings)
+-   [Simple queries](#simple-queries)
+-   [Queries with language tags](#queries-with-language-tags)
+-   [Queries that retrieve literals](#queries-that-retrieve-literals)
+-   [Queries with graphs](#queries-with-graphs)
+-   [Queries across multiple `Fields`](#queries-across-multiple-fields)
+-   [Queries with _Boolean Operators_ and _Term 
Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
+-   [Good practice](#good-practice)
 -   [Configuration](#configuration)
 -   [Text Dataset Assembler](#text-dataset-assembler)
 -   [Configuring an analyzer](#configuring-an-analyzer)
@@ -108,6 +124,7 @@
 
 The text index uses the native query language of the index:
 [Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+(with [restrictions](#input-arguments))
 or
 [Elasticsearch query 
language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
 
@@ -134,6 +151,64 @@
 By using Elasticsearch, other applications can share the text index with
 SPARQL search.
 
+### Document structure
+
+As mentioned above, text indexing of a triple involves associating a Lucene
+document with the triple. How is this done?
+
+Lucene documents are composed of `Field`s. Indexing and searching are 
performed 
+over the contents of these `Field`s. For an RDF triple to be indexed in Lucene 
the 
+_property_ of the triple must be 
+[configured in the entity map of a TextIndex](#entity-map-definition).
+This associates a Lucene analyzer with the _`property`_ which will be used
+for indexing and search. The _`property`_ becomes the _searchable_ Lucene 
+`Field` in the resulting document.
+
+A Lucene index includes a _default_ `Field`, which is specified in the 
configuration, 
+that is the field to search if not otherwise named in the query. In jena-text 
+this field is configured via the `text:defaultField` property which is then 
mapped 
+to a specific RDF property via `text:predicate` (see [entity 
map](#entity-map-definition) 
+below).
+
+There are several additional `Field`s that will be included in the
+document that is passed to the Lucene `IndexWriter` depending on the
+configuration options that are used. These additional fields are used to
+manage the interface between Jena and Lucene and are not generally 
+searchable per se.
+
+The most important of these additional `Field`s is the `text:entityField`.
+This configuration property defines the name of the `Field` that will contain
+the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. 
This property does
+not have a default and must be specified for most uses of `jena-text`. This
+`Field` is often given the name, `uri`, in examples. It is via this `Field`
+that `?s` is bound in a typical use such as:
+
+select ?s
+where {
+?s text:query "some text"
+}
+
+Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
+and so on are discussed below.
+
+Given the triple:
+
+ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
+
+The following is an abbreviated illustration a Lucene document that Jena will 
create and
+request Lucene to index:
+
+Document<
+ 
+ 
+ 
+ 
+ 
+>
+
+It may be instructive to refer back to this example when considering the 
various
+points below.
+
 ## Query with SPARQL
 
 The URI of the text extension property function is
@@ -143,63 +218,298 @@
 
 ...   text:query ...
 
+### Syntax
 
 The following forms are all legal:
 
-?s text:query 'word'   # query
-?s text:query (rdfs:label 'word')  # query specific property if 
multiple
-?s text:query ('word' 10)  

CMS diff: Jena Full Text Search

2017-11-24 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1816205)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
 Title: Jena Full Text Search
 
+Title: Jena Full Text Search
+
 This extension to ARQ combines SPARQL and full text search via
 [Lucene](https://lucene.apache.org) 6.4.1 or
 [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -64,7 +66,20 @@
 ## Table of Contents
 
 -   [Architecture](#architecture)
+-   [External content](#external-content)
+-   [External applications](#external-applications)
+-   [Document structure](#document-structure)
 -   [Query with SPARQL](#query-with-sparql)
+-   [Syntax](#syntax)
+-   [Input arguments](#input-arguments)
+-   [Output arguments](#output-arguments)
+-   [Query strings](#query-strings)
+-   [Simple queries](#simple-queries)
+-   [Queries with language tags](#queries-with-language-tags)
+-   [Queries that retrieve literals](#queries-that-retrieve-literals)
+-   [Queries across multiple `Field`s](#queries-across-multiple-fields)
+-   [Queries with _Boolean Operators_ and _Term 
Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
+-   [Good practice](#good-practice)
 -   [Configuration](#configuration)
 -   [Text Dataset Assembler](#text-dataset-assembler)
 -   [Configuring an analyzer](#configuring-an-analyzer)
@@ -108,6 +123,7 @@
 
 The text index uses the native query language of the index:
 [Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+(with [restrictions](#input-arguments))
 or
 [Elasticsearch query 
language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
 
@@ -134,6 +150,64 @@
 By using Elasticsearch, other applications can share the text index with
 SPARQL search.
 
+### Document structure
+
+As mentioned above, text indexing of a triple involves associating a Lucene
+document with the triple. How is this done?
+
+Lucene documents are composed of `Field`s. Indexing and searching are 
performed 
+over the contents of these `Field`s. For an RDF triple to be indexed in Lucene 
the 
+_property_ of the triple must be 
+[configured in the entity map of a TextIndex](#entity-map-definition).
+This associates a Lucene analyzer with the _`property`_ which will be used
+for indexing and search. The _`property`_ becomes the _searchable_ Lucene 
+`Field` in the resulting document.
+
+A Lucene index includes a _default_ `Field`, which is specified in the 
configuration, 
+that is the field to search if not otherwise named in the query. In jena-text 
+this field is configured via the `text:defaultField` property which is then 
mapped 
+to a specific RDF property via `text:predicate` (see [entity 
map](#entity-map-definition) 
+below).
+
+There are several additional `Field`s that will be included in the
+document that is passed to the Lucene `IndexWriter` depending on the
+configuration options that are used. These additional fields are used to
+manage the interface between Jena and Lucene and are not generally 
+searchable per se.
+
+The most important of these additional `Field`s is the `text:entityField`.
+This configuration property defines the name of the `Field` that will contain
+the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. 
This property does
+not have a default and must be specified for most uses of `jena-text`. This
+`Field` is often given the name, `uri`, in examples. It is via this `Field`
+that `?s` is bound in a typical use such as:
+
+select ?s
+where {
+?s text:query "some text"
+}
+
+Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
+and so on are discussed below.
+
+Given the triple:
+
+ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
+
+The following is an abbreviated illustration a Lucene document that Jena will 
create and
+request Lucene to index:
+
+Document<
+ 
+ 
+ 
+ 
+ 
+>
+
+It may be instructive to refer back to this example when considering the 
various
+points below.
+
 ## Query with SPARQL
 
 The URI of the text extension property function is
@@ -143,63 +217,233 @@
 
 ...   text:query ...
 
+### Syntax
 
 The following forms are all legal:
 
-?s text:query 'word'   # query
-?s text:query (rdfs:label 'word')  # query specific property if 
multiple
-?s text:query ('word' 10)  # with limit on results
-(?s ?score) text:query 

Re: CMS diff: Jena Full Text Search

2017-11-22 Thread Chris Tomlinson
Hi Andy and Osma,

I posted JENA-1426  since the 
“improve this page” facility didn’t seem to offer any way to add a commit 
message or more extensive explanation of the reasons for the proposed edits and 
they were somewhat extensive. So raising an issue seemed a way to proceed; 
however, after several days with no comments I thought perhaps I should follow 
the published protocol and I made the update as guest on the CMS.

I had several motivations regarding updating the documentation: 1) I wanted to 
present how the current implementation functions in a way that might be more 
useful to users - for example clarifying what can be expected to work and what 
not in terms of using the native Lucene query language, e.g., JENA-1388 
; 2) identify areas that might 
indicate perhaps unintended aspects of the current implementation; and 3) 
understand the code in preparation for developing a proposal for adding 
jena-text highlighting support 
.

Based on Osma’s feedback I will be opening a few issues on JIRA and making 
corrections to the original submission. I assume that updates should just be 
made as further commits.

Thanks,
Chris



> On Nov 22, 2017, at 6:41 AM, Andy Seaborne  wrote:
> 
> How is this related to JENA-1426?
> 
>Andy
> 
> On 21/11/17 14:48, Osma Suominen wrote:
>> ajs6f kirjoitti 20.11.2017 klo 18:36:
>>> Osma (or anyone else who knows text indexing better than do I, which 
>>> wouldn't take much)-- could you review this? It's got some great useful 
>>> detail about how the indexing works and can be used.
>> Sure, will do.
>> Comments about specific sections below. Generally this is a very good 
>> contribution to the jena-text documentation, which has stagnated a bit.
 +The following illustrates a Lucene document that Jena will create and
 +request Lucene to index:
 +
 +Document<
 +stored, indexed, indexOptions=DOCS 
 
 +indexed, omitNorms, indexOptions=DOCS 
 
 +stored, indexed, tokenized 
 +stored, indexed, omitNorms, indexOptions=DOCS 
 +stored, indexed, tokenized 
 +stored, indexed, omitNorms, indexOptions=DOCS 
 
 +stored, indexed, tokenized 
 +stored, indexed, omitNorms, indexOptions=DOCS 
 +stored, indexed, tokenized 
 +stored, indexed, omitNorms, indexOptions=DOCS 
 
 +>
 +
 +It may be instructive to refer back to this example when considering the 
 various
 +points below.
>> Not sure if this is a perfect illustration. The level of detail is rather 
>> excessive. I know Lucene quite well and I still struggle to understand 
>> what's going on here. Is there another way of presenting this information, 
>> for example just a key-value list that shows the field values that get 
>> stored in the document? I think the field options stored, indexed, 
>> tokenized, omitNorms etc. are unnecessary here or at least should not be so 
>> prominent.
 +The `lang:xx` specification is an optional string, where _xx_ is
 +a BCP-47 language tag. This restricts searches to field values that were 
 originally
 +indexed with the tag _xx_. Searches may be restricted to field values 
 with no
 +language tag via `"lang:none"`. The use of the `lang:xx` is only 
 effective if
 +[multilingual support](#linguistic-support-with-lucene-index) has been 
 configured.
>> The last sentence is not true. You can restrict by language even without 
>> enabling multilingual support, as long as langField has been set.
 +Further, if the `lang:xx` is used then the `property` URI must be supplied
 +in order for searches to work.
>> Not true. The default property should be used if no property was specified.
 +When working with `rdf:langString`s It may be tempting to write:
 +
 +?s text:query "protégé"@fr
 +
 +However, the above will silently fail to return results since the
 +`query string` must be a simple `xsd:string` not an `rdf:langString`.
>> This could be considered a bug - at least it shouldn't fail silently.
 +Even if the default _property_ is `skos:prefLabel` it is necessary
 +to use the above form rather than omitting the `property` argument
 +when restricting the Lucene search to a specific `lang:xx`; otherwise,
 +again there will be no results.
>> Again, not true. I just tested this query against YSO:
>>  ?s text:query ("cat" "lang:en")
>> and it gave a single result, as expected.
 +For a non-default `Field` with no language restriction, the patterns:
 +
 +?s text:query (rdfs:label "protégé")
 +
 +or
 +
 +?s text:query "rdfsLabel:protégé"
 +
 +may be used (see 

Re: CMS diff: Jena Full Text Search

2017-11-22 Thread Andy Seaborne

How is this related to JENA-1426?

Andy

On 21/11/17 14:48, Osma Suominen wrote:

ajs6f kirjoitti 20.11.2017 klo 18:36:
Osma (or anyone else who knows text indexing better than do I, which 
wouldn't take much)-- could you review this? It's got some great 
useful detail about how the indexing works and can be used.


Sure, will do.

Comments about specific sections below. Generally this is a very good 
contribution to the jena-text documentation, which has stagnated a bit.




+The following illustrates a Lucene document that Jena will create and
+request Lucene to index:
+
+    Document<
+    stored, indexed, indexOptions=DOCS 

+    indexed, omitNorms, indexOptions=DOCS 


+    stored, indexed, tokenized 
+    stored, indexed, omitNorms, indexOptions=DOCS 
+    stored, indexed, tokenized 
+    stored, indexed, omitNorms, indexOptions=DOCS 


+    stored, indexed, tokenized 
+    stored, indexed, omitNorms, indexOptions=DOCS 
+    stored, indexed, tokenized 

+    stored, indexed, omitNorms, indexOptions=DOCS 


+    >
+
+It may be instructive to refer back to this example when considering 
the various

+points below.


Not sure if this is a perfect illustration. The level of detail is 
rather excessive. I know Lucene quite well and I still struggle to 
understand what's going on here. Is there another way of presenting this 
information, for example just a key-value list that shows the field 
values that get stored in the document? I think the field options 
stored, indexed, tokenized, omitNorms etc. are unnecessary here or at 
least should not be so prominent.




+The `lang:xx` specification is an optional string, where _xx_ is
+a BCP-47 language tag. This restricts searches to field values that 
were originally
+indexed with the tag _xx_. Searches may be restricted to field 
values with no
+language tag via `"lang:none"`. The use of the `lang:xx` is only 
effective if
+[multilingual support](#linguistic-support-with-lucene-index) has 
been configured.


The last sentence is not true. You can restrict by language even without 
enabling multilingual support, as long as langField has been set.


+Further, if the `lang:xx` is used then the `property` URI must be 
supplied

+in order for searches to work.


Not true. The default property should be used if no property was specified.



+When working with `rdf:langString`s It may be tempting to write:
+
+    ?s text:query "protégé"@fr
+
+However, the above will silently fail to return results since the
+`query string` must be a simple `xsd:string` not an `rdf:langString`.


This could be considered a bug - at least it shouldn't fail silently.


+Even if the default _property_ is `skos:prefLabel` it is necessary
+to use the above form rather than omitting the `property` argument
+when restricting the Lucene search to a specific `lang:xx`; otherwise,
+again there will be no results.


Again, not true. I just tested this query against YSO:
  ?s text:query ("cat" "lang:en")
and it gave a single result, as expected.


+For a non-default `Field` with no language restriction, the patterns:
+
+    ?s text:query (rdfs:label "protégé")
+
+or
+
+    ?s text:query "rdfsLabel:protégé"
+
+may be used (see [below](#entity-map-definition) for how RDF 
_property_ names
+are mapped to Lucene `Field` names). 


I wouldn't recommend using a query form like "rdfsLabel:protégé" in the 
documentation at all. It violates the layered architecture of jena-text 
- the query should not be targeting named fields. If you want to target 
rdfs:label, use the first form.



However, as mentioned earlier,
+
+    ?s text:query ("rdfsLabel:protégé" "lang:fr")
+
+will result in an error owing to the way in which the jena-text 
composes the

+query string to Lucene in the presence of the `"lang:fr"` argument.


Don't do that then. Remove this section. (see previous comment)


+However, it is important to note that the apparently equivalent form:
+
+    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
+
+will fail to produce a binding for `?lit` even though `?s` and `?sc` 
are

+bound as expected.


Again, don't do that. Use (rdfs:label "protégé") instead and let 
jena-text handle the translation from property to Lucene field.


+So if the _literal_ matches are needed you **must use** the query 
arguments that
+list the _property_ explicitly, except in the simple case of a query 
against

+the default `Field`/_property_.


Exactly. And those are the only supported query forms anyway.


+ Queries across multiple `Field`s
+
+It has been mentioned earlier that the text index uses the
+[native Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description); 

+however, there are important constraints on how the Lucene query 
language is used within jena-text.
+This is owing to the fact that jena-text composes the query string 
that is 

Re: CMS diff: Jena Full Text Search

2017-11-21 Thread Osma Suominen

ajs6f kirjoitti 20.11.2017 klo 18:36:

Osma (or anyone else who knows text indexing better than do I, which wouldn't 
take much)-- could you review this? It's got some great useful detail about how 
the indexing works and can be used.


Sure, will do.

Comments about specific sections below. Generally this is a very good 
contribution to the jena-text documentation, which has stagnated a bit.




+The following illustrates a Lucene document that Jena will create and
+request Lucene to index:
+
+Document<
+stored, indexed, indexOptions=DOCS 
+indexed, omitNorms, indexOptions=DOCS 

+stored, indexed, tokenized 
+stored, indexed, omitNorms, indexOptions=DOCS 
+stored, indexed, tokenized 
+stored, indexed, omitNorms, indexOptions=DOCS 

+stored, indexed, tokenized 
+stored, indexed, omitNorms, indexOptions=DOCS 
+stored, indexed, tokenized 
+stored, indexed, omitNorms, indexOptions=DOCS 

+>
+
+It may be instructive to refer back to this example when considering the 
various
+points below.


Not sure if this is a perfect illustration. The level of detail is 
rather excessive. I know Lucene quite well and I still struggle to 
understand what's going on here. Is there another way of presenting this 
information, for example just a key-value list that shows the field 
values that get stored in the document? I think the field options 
stored, indexed, tokenized, omitNorms etc. are unnecessary here or at 
least should not be so prominent.




+The `lang:xx` specification is an optional string, where _xx_ is
+a BCP-47 language tag. This restricts searches to field values that were 
originally
+indexed with the tag _xx_. Searches may be restricted to field values with no
+language tag via `"lang:none"`. The use of the `lang:xx` is only effective if
+[multilingual support](#linguistic-support-with-lucene-index) has been 
configured.


The last sentence is not true. You can restrict by language even without 
enabling multilingual support, as long as langField has been set.



+Further, if the `lang:xx` is used then the `property` URI must be supplied
+in order for searches to work.


Not true. The default property should be used if no property was specified.



+When working with `rdf:langString`s It may be tempting to write:
+
+?s text:query "protégé"@fr
+
+However, the above will silently fail to return results since the
+`query string` must be a simple `xsd:string` not an `rdf:langString`.


This could be considered a bug - at least it shouldn't fail silently.


+Even if the default _property_ is `skos:prefLabel` it is necessary
+to use the above form rather than omitting the `property` argument
+when restricting the Lucene search to a specific `lang:xx`; otherwise,
+again there will be no results.


Again, not true. I just tested this query against YSO:
 ?s text:query ("cat" "lang:en")
and it gave a single result, as expected.


+For a non-default `Field` with no language restriction, the patterns:
+
+?s text:query (rdfs:label "protégé")
+
+or
+
+?s text:query "rdfsLabel:protégé"
+
+may be used (see [below](#entity-map-definition) for how RDF _property_ names
+are mapped to Lucene `Field` names). 


I wouldn't recommend using a query form like "rdfsLabel:protégé" in the 
documentation at all. It violates the layered architecture of jena-text 
- the query should not be targeting named fields. If you want to target 
rdfs:label, use the first form.



However, as mentioned earlier,
+
+?s text:query ("rdfsLabel:protégé" "lang:fr")
+
+will result in an error owing to the way in which the jena-text composes the
+query string to Lucene in the presence of the `"lang:fr"` argument.


Don't do that then. Remove this section. (see previous comment)


+However, it is important to note that the apparently equivalent form:
+
+(?s ?sc ?lit) text:query "rdfsLabel:protégé"
+
+will fail to produce a binding for `?lit` even though `?s` and `?sc` are
+bound as expected.


Again, don't do that. Use (rdfs:label "protégé") instead and let 
jena-text handle the translation from property to Lucene field.



+So if the _literal_ matches are needed you **must use** the query arguments 
that
+list the _property_ explicitly, except in the simple case of a query against
+the default `Field`/_property_.


Exactly. And those are the only supported query forms anyway.


+ Queries across multiple `Field`s
+
+It has been mentioned earlier that the text index uses the
+[native Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description);
+however, there are important constraints on how the Lucene query language is 
used within jena-text.
+This is owing to the fact that jena-text composes the query string that is 
sent to Lucene so that
+features such as `lang:xx` may be implemented. Other aspects of using the 

Re: CMS diff: Jena Full Text Search

2017-11-20 Thread ajs6f
I went to review this diff and rediscovered (to my chagrin) that I really know 
very little about Jena's text indexing.

Osma (or anyone else who knows text indexing better than do I, which wouldn't 
take much)-- could you review this? It's got some great useful detail about how 
the indexing works and can be used.

ajs6f

> On Nov 20, 2017, at 1:51 AM, Chris Tomlinson  wrote:
> 
> Clone URL (Committers only):
> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
> 
> Chris Tomlinson
> 
> Index: trunk/content/documentation/query/text-query.mdtext
> ===
> --- trunk/content/documentation/query/text-query.mdtext   (revision 
> 1815762)
> +++ trunk/content/documentation/query/text-query.mdtext   (working copy)
> @@ -1,5 +1,7 @@
> Title: Jena Full Text Search
> 
> +Title: Jena Full Text Search
> +
> This extension to ARQ combines SPARQL and full text search via
> [Lucene](https://lucene.apache.org) 6.4.1 or
> [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
> @@ -64,7 +66,20 @@
> ## Table of Contents
> 
> -   [Architecture](#architecture)
> +-   [External content](#external-content)
> +-   [External applications](#external-applications)
> +-   [Document structure](#document-structure)
> -   [Query with SPARQL](#query-with-sparql)
> +-   [Syntax](#syntax)
> +-   [Input arguments](#input-arguments)
> +-   [Output arguments](#output-arguments)
> +-   [Query strings](#query-strings)
> +-   [Simple queries](#simple-queries)
> +-   [Queries with language tags](#queries-with-language-tags)
> +-   [Queries that retrieve literals](#queries-that-retrieve-literals)
> +-   [Queries across multiple 
> `Field`s](#queries-across-multiple-fields)
> +-   [Queries within a `Field`](#queries-within-a-field)
> +-   [Good practice](#good-practice)
> -   [Configuration](#configuration)
> -   [Text Dataset Assembler](#text-dataset-assembler)
> -   [Configuring an analyzer](#configuring-an-analyzer)
> @@ -134,6 +149,69 @@
> By using Elasticsearch, other applications can share the text index with
> SPARQL search.
> 
> +### Document structure
> +
> +As mentioned above, text indexing of a triple involves associating a Lucene
> +document with the triple. How is this done?
> +
> +Lucene documents are composed of `Field`s. Indexing and searching are 
> performed 
> +over the contents of these `Field`s. For an RDF triple to be indexed in 
> Lucene the 
> +_property_ of the triple must be 
> +[configured in the entity map of a TextIndex](#entity-map-definition).
> +This associates a Lucene analyzer with the _`property`_ which will be used
> +for indexing and search. The _`property`_ becomes the _searchable_ Lucene 
> +`Field` in the resulting document.
> +
> +A Lucene index includes a _default_ `Field`, which is specified in the 
> configuration, 
> +that is the field to search if not otherwise named in the query. In 
> jena-text 
> +this field is configured via the `text:defaultField` property which is then 
> mapped 
> +to a specific RDF property via `text:predicate` (see [entity 
> map](#entity-map-definition) 
> +below).
> +
> +There are several additional `Field`s that will be included in the
> +document that is passed to the Lucene `IndexWriter` depending on the
> +configuration options that are used. These additional fields are used to
> +manage the interface between Jena and Lucene and are not generally 
> +searchable per se.
> +
> +The most important of these additional `Field`s is the `text:entityField`.
> +This configuration property defines the name of the `Field` that will contain
> +the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. 
> This property does
> +not have a default and must be specified for most uses of `jena-text`. This
> +`Field` is often given the name, `uri`, in examples. It is via this `Field`
> +that `?s` is bound in a typical use such as:
> +
> +select ?s
> +where {
> +?s text:query "some text"
> +}
> +
> +Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
> +and so on are discussed below.
> +
> +Given the triple:
> +
> +ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
> +
> +The following illustrates a Lucene document that Jena will create and
> +request Lucene to index:
> +
> +Document<
> +stored, indexed, indexOptions=DOCS  
> +indexed, omitNorms, indexOptions=DOCS 
>  
> +stored, indexed, tokenized  
> +stored, indexed, omitNorms, indexOptions=DOCS  
> +stored, indexed, tokenized  
> +stored, indexed, omitNorms, indexOptions=DOCS 
>  
> +stored, indexed, tokenized  
> +stored, indexed, omitNorms, indexOptions=DOCS  
> +stored, indexed, tokenized  
> +

CMS diff: Jena Full Text Search

2017-11-19 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1815762)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
 Title: Jena Full Text Search
 
+Title: Jena Full Text Search
+
 This extension to ARQ combines SPARQL and full text search via
 [Lucene](https://lucene.apache.org) 6.4.1 or
 [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -64,7 +66,20 @@
 ## Table of Contents
 
 -   [Architecture](#architecture)
+-   [External content](#external-content)
+-   [External applications](#external-applications)
+-   [Document structure](#document-structure)
 -   [Query with SPARQL](#query-with-sparql)
+-   [Syntax](#syntax)
+-   [Input arguments](#input-arguments)
+-   [Output arguments](#output-arguments)
+-   [Query strings](#query-strings)
+-   [Simple queries](#simple-queries)
+-   [Queries with language tags](#queries-with-language-tags)
+-   [Queries that retrieve literals](#queries-that-retrieve-literals)
+-   [Queries across multiple `Field`s](#queries-across-multiple-fields)
+-   [Queries within a `Field`](#queries-within-a-field)
+-   [Good practice](#good-practice)
 -   [Configuration](#configuration)
 -   [Text Dataset Assembler](#text-dataset-assembler)
 -   [Configuring an analyzer](#configuring-an-analyzer)
@@ -134,6 +149,69 @@
 By using Elasticsearch, other applications can share the text index with
 SPARQL search.
 
+### Document structure
+
+As mentioned above, text indexing of a triple involves associating a Lucene
+document with the triple. How is this done?
+
+Lucene documents are composed of `Field`s. Indexing and searching are 
performed 
+over the contents of these `Field`s. For an RDF triple to be indexed in Lucene 
the 
+_property_ of the triple must be 
+[configured in the entity map of a TextIndex](#entity-map-definition).
+This associates a Lucene analyzer with the _`property`_ which will be used
+for indexing and search. The _`property`_ becomes the _searchable_ Lucene 
+`Field` in the resulting document.
+
+A Lucene index includes a _default_ `Field`, which is specified in the 
configuration, 
+that is the field to search if not otherwise named in the query. In jena-text 
+this field is configured via the `text:defaultField` property which is then 
mapped 
+to a specific RDF property via `text:predicate` (see [entity 
map](#entity-map-definition) 
+below).
+
+There are several additional `Field`s that will be included in the
+document that is passed to the Lucene `IndexWriter` depending on the
+configuration options that are used. These additional fields are used to
+manage the interface between Jena and Lucene and are not generally 
+searchable per se.
+
+The most important of these additional `Field`s is the `text:entityField`.
+This configuration property defines the name of the `Field` that will contain
+the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. 
This property does
+not have a default and must be specified for most uses of `jena-text`. This
+`Field` is often given the name, `uri`, in examples. It is via this `Field`
+that `?s` is bound in a typical use such as:
+
+select ?s
+where {
+?s text:query "some text"
+}
+
+Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
+and so on are discussed below.
+
+Given the triple:
+
+ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
+
+The following illustrates a Lucene document that Jena will create and
+request Lucene to index:
+
+Document<
+stored, indexed, indexOptions=DOCS  
+indexed, omitNorms, indexOptions=DOCS 
 
+stored, indexed, tokenized  
+stored, indexed, omitNorms, indexOptions=DOCS  
+stored, indexed, tokenized  
+stored, indexed, omitNorms, indexOptions=DOCS 
 
+stored, indexed, tokenized  
+stored, indexed, omitNorms, indexOptions=DOCS  
+stored, indexed, tokenized  
+stored, indexed, omitNorms, indexOptions=DOCS 

+>
+
+It may be instructive to refer back to this example when considering the 
various
+points below.
+
 ## Query with SPARQL
 
 The URI of the text extension property function is
@@ -143,63 +221,248 @@
 
 ...   text:query ...
 
+### Syntax
 
 The following forms are all legal:
 
-?s text:query 'word'   # query
-?s text:query (rdfs:label 'word')  # query specific property if 
multiple
-?s text:query ('word' 10)  # with limit on results
-(?s ?score) text:query 'word'  # query capturing also the score
-