[jira] [Created] (JENA-1645) Poor performance with full text search (Lucene)

2018-12-03 Thread Vasyl Danyliuk (JIRA)
Vasyl Danyliuk created JENA-1645:


 Summary: Poor performance with full text search (Lucene)
 Key: JENA-1645
 URL: https://issues.apache.org/jira/browse/JENA-1645
 Project: Apache Jena
  Issue Type: Question
  Components: Jena
Affects Versions: Jena 3.9.0
Reporter: Vasyl Danyliuk


Situation: half of a million of an indexed by Lucene documents(emails 
actually), searching for emails by sender/receiver and some text.

If to put text filter in the start of SPARQL query it executes once but in a 
case of very common words here are a lot of results(100 000+) that leads to 
poor performance, limiting results count may and up with missed results.

If to put text search as the last condition it executes once per each already 
found subject. That's completely OK but text search completely ignores subject 
URI.

I found two methods in TextQueryPF class: variableSubject(...) for the first 
case, and concreteSubject(...) for the second one.

The question is: why can't subject URI be used as a constraint in the text 
search?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Toward Jena 3.10.0

2018-12-03 Thread Greg Albiston

Hi Marco,

1. As mentioned this shouldn't be too difficult to support.

2. Yes, the indexing, or rather caching, is in-memory, but it is 
on-demand. There shouldn't be any delay at start-up beyond what Jena 
needs to do. The cost comes during query execution. The key invariant 
data produced for solutions is retained for a short period of time (but 
can be configured to be retained until termination). Some regularly 
re-used info is always kept until termination (e.g. any spatial 
reference system transformation that has been requested).


The main benefit of this is de-serialising geometry literals. The 
spatial relations arguments are between a pair of geometry literals, one 
of which is likely to be the same in the next solution, so keeping hold 
of both means in alot of cases the de-serialisation can be avoided for 
one (and possibly both if still retained from a previous set of solutions).


The aim was to only do work that's needed, not do repeat work and to be 
generally quick (i.e. rely on JTS to be optimised for quick solutions 
between the geometry pairs and Jena to optimise queries). There are 24 
spatial relations and about half a dozen other functions so 
pre-computing every combination gets big quickly and produces data that 
users might not want/use.


A rough check of most the spatial relations only requires a bounding box 
intersection or type check, so negative results can be quickly 
discarded.  I looked into caching and storing to file, but there just 
wasn't the benefit in my use case. It took longer to load up then 
execute than just execute from fresh and cache. Also, the spatial 
indexes implemented by JTS aren't designed/suited for the spatial 
relations. If there is a use-case that gets more benefit from 
pre-computing or storing between programme execution then I'm sure it 
can be adapted for, but in the context of GeoSPARQL this approach was 
effective.


3. If you could send me the dataset that causes these errors then I'll 
happily have a look into it.


4. The "geo:" prefix is the one used throughout the GeoSPARQL 
documentation, so has been used for consistency when needed. The code 
doesn't have a dependency on the "geo:" prefix, so there is no 
requirement on the user. It would probably cause more confusion to those 
following GeoSPARQL examples to not use the "geo:" prefix when necessary.


Thanks,

Greg

On 03/12/2018 15:46, Marco Neumann wrote:

Hi Greg, ok let's do it in the dev list first.

1. indeed the picking up of lat/long is a common if not the most common use
case for building a spatial index. last but not least to perform a
proximity search on 2D point geometries. (I know that the ogc recommends a
transformation path with a sparql query to turn lat / long into a WKT
geometry datatypes maybe we could provide this as a convenient option with
the release)

2. as far as I can see the spatial index in geosparql-jena is memory based.
it creates additional load time during server startup. Am I missing
something here, is there a file base spatial index as well?

3. error handling is disruptive. since we are hitting the spatial index
first during query execution I am seeing a number of unpleasant side
effects with syntactically correct sparql but semantically incorrect
spatial queries. e.g.

PREFIX geo: 
PREFIX geof: 

SELECT ?well
WHERE {
?well   ?geometry .
   FILTER(geof:sfWithin(?geometry,"POLYGON((-77 38,-77 0,0 38,0 0,0
0))"^^geo:wktLiteral))
} LIMIT 10

4. The re-use of the geo: prefix really isn't your problem I know but it
will create confusion. Wouldn't geosparql: be a better prefix for this? Is
the OGC now married to this prefix? It used to be
http://www.w3.org/2003/01/geo/wgs84_pos#

and there is more to come...

again thank you for working on this with your team Greg, much appreciated.








On Mon, Dec 3, 2018 at 2:15 PM Greg Albiston  wrote:


Hi Marco,

I've had a look at the doucmentation for Jena Spatial and it would seem
the main data change is the use of the Lat/Lon pairs.
This doesn't comply with the GeoSPARQL standard so support for this
would be a Jena extension.

This could be accomodated by a property function to convert to a WKT
Point literal with WGS84/CRS84 spatial reference.
Users would then be able to use the result in query for any of the
GeoSPARQL functions.

Alternatively, the spatial relations could all have an extra property
function defined, provide the conversion and hand over to the GeoSPARQL
equivalent property function. This wouldn't take long to integrate as
individual spatial relation property functions are very minimal.

The other item that jumps out is the Jena spatial property functions.

spatial:nearby, spatial:withinCircle, spatial:withinBox and
spatial:interesectBox all seem to be variations of Simple Features
spatial relations that are covered by GeoSPARQL. These property
functions can be 

Re: Toward Jena 3.10.0

2018-12-03 Thread Marco Neumann
Hi Greg, ok let's do it in the dev list first.

1. indeed the picking up of lat/long is a common if not the most common use
case for building a spatial index. last but not least to perform a
proximity search on 2D point geometries. (I know that the ogc recommends a
transformation path with a sparql query to turn lat / long into a WKT
geometry datatypes maybe we could provide this as a convenient option with
the release)

2. as far as I can see the spatial index in geosparql-jena is memory based.
it creates additional load time during server startup. Am I missing
something here, is there a file base spatial index as well?

3. error handling is disruptive. since we are hitting the spatial index
first during query execution I am seeing a number of unpleasant side
effects with syntactically correct sparql but semantically incorrect
spatial queries. e.g.

PREFIX geo: 
PREFIX geof: 

SELECT ?well
WHERE {
   ?well   ?geometry .
  FILTER(geof:sfWithin(?geometry,"POLYGON((-77 38,-77 0,0 38,0 0,0
0))"^^geo:wktLiteral))
} LIMIT 10

4. The re-use of the geo: prefix really isn't your problem I know but it
will create confusion. Wouldn't geosparql: be a better prefix for this? Is
the OGC now married to this prefix? It used to be
http://www.w3.org/2003/01/geo/wgs84_pos#

and there is more to come...

again thank you for working on this with your team Greg, much appreciated.








On Mon, Dec 3, 2018 at 2:15 PM Greg Albiston  wrote:

> Hi Marco,
>
> I've had a look at the doucmentation for Jena Spatial and it would seem
> the main data change is the use of the Lat/Lon pairs.
> This doesn't comply with the GeoSPARQL standard so support for this
> would be a Jena extension.
>
> This could be accomodated by a property function to convert to a WKT
> Point literal with WGS84/CRS84 spatial reference.
> Users would then be able to use the result in query for any of the
> GeoSPARQL functions.
>
> Alternatively, the spatial relations could all have an extra property
> function defined, provide the conversion and hand over to the GeoSPARQL
> equivalent property function. This wouldn't take long to integrate as
> individual spatial relation property functions are very minimal.
>
> The other item that jumps out is the Jena spatial property functions.
>
> spatial:nearby, spatial:withinCircle, spatial:withinBox and
> spatial:interesectBox all seem to be variations of Simple Features
> spatial relations that are covered by GeoSPARQL. These property
> functions can be incorpated for backward compatability but it's whether
> these should just be offered as the current Lat/Lon pairs or expanded to
> accept geometry literals (i.e. WKT, GML etc.)? The latter option
> shouldn't be hard to provide for the same reason as above.
>
> spatial:north, spatial:south, spatial:west and spatial:east are not in
> GeoSPARQL. Again its a question of whether these should be provided more
> generally for WKT, GML geometry literals? There might need to be a bit
> of extra work handling both geographic and planar spatial reference
> systems, as Jean Spatial is only doing a spatial reference system.
>
> I don't think it would be too difficult to support the existing Jena
> Spatial functionality, at least based on the webpage
> (https://jena.apache.org/documentation/query/spatial-query.html), as an
> extension to what is provided by GeoSPARQL.
>
> Is there anything else that you were concerned about?
>
> Thanks,
>
> Greg
>
>
> On 03/12/2018 10:53, Marco Neumann wrote:
> > so I've had a look at this and while I think geosparql-jena is a very
> > welcomed contribution to the jena project I don't think we should rush
> with
> > the retirement of  jena-spatial at this point as Greg's approach will
> > require users to make changes to their existing data.
> >
> > I will engage Greg on us...@jena.apache.org again to clarify a few
> things
> > and hopefully get more people involved in this conversation around
> spatial,
> > geosparql and jena.
> >
> >
> >
> > On Fri, Nov 30, 2018 at 1:23 PM Marco Neumann 
> > wrote:
> >
> >> how quickly can you hook geosparql into the release?
> >>
> >> this would make lucene spatial obsolete in the next release.  has Greg
> >> released performance benchmarks for his implementation? as I said I will
> >> take a look at it over the weekend when time permits.
> >>
> >> On Fri, Nov 30, 2018 at 11:02 AM Andy Seaborne  wrote:
> >>
> >>> We could retire jena-spatial immediately after 3.10.0 - given the
> Lucene
> >>> change that might be smoother, one release with updated dependencies.
> >>>
> >>> If that is the way forward, I think it is (mildly) better to take it
> out
> >>> of the Fuseki/Full build in 3.10.0.
> >>>
> >>>   Andy
> >>>
> >>> On 29/11/2018 17:00, Marco Neumann wrote:
>  I will have to look into that I guess since I am frequent user of
> >>> spatial
>  data.
> 
>  why not go to 

Re: Toward Jena 3.10.0

2018-12-03 Thread Greg Albiston

Hi Marco,

I've had a look at the doucmentation for Jena Spatial and it would seem 
the main data change is the use of the Lat/Lon pairs.
This doesn't comply with the GeoSPARQL standard so support for this 
would be a Jena extension.


This could be accomodated by a property function to convert to a WKT 
Point literal with WGS84/CRS84 spatial reference.
Users would then be able to use the result in query for any of the 
GeoSPARQL functions.


Alternatively, the spatial relations could all have an extra property 
function defined, provide the conversion and hand over to the GeoSPARQL 
equivalent property function. This wouldn't take long to integrate as 
individual spatial relation property functions are very minimal.


The other item that jumps out is the Jena spatial property functions.

spatial:nearby, spatial:withinCircle, spatial:withinBox and 
spatial:interesectBox all seem to be variations of Simple Features 
spatial relations that are covered by GeoSPARQL. These property 
functions can be incorpated for backward compatability but it's whether 
these should just be offered as the current Lat/Lon pairs or expanded to 
accept geometry literals (i.e. WKT, GML etc.)? The latter option 
shouldn't be hard to provide for the same reason as above.


spatial:north, spatial:south, spatial:west and spatial:east are not in 
GeoSPARQL. Again its a question of whether these should be provided more 
generally for WKT, GML geometry literals? There might need to be a bit 
of extra work handling both geographic and planar spatial reference 
systems, as Jean Spatial is only doing a spatial reference system.


I don't think it would be too difficult to support the existing Jena 
Spatial functionality, at least based on the webpage 
(https://jena.apache.org/documentation/query/spatial-query.html), as an 
extension to what is provided by GeoSPARQL.


Is there anything else that you were concerned about?

Thanks,

Greg


On 03/12/2018 10:53, Marco Neumann wrote:

so I've had a look at this and while I think geosparql-jena is a very
welcomed contribution to the jena project I don't think we should rush with
the retirement of  jena-spatial at this point as Greg's approach will
require users to make changes to their existing data.

I will engage Greg on us...@jena.apache.org again to clarify a few things
and hopefully get more people involved in this conversation around spatial,
geosparql and jena.



On Fri, Nov 30, 2018 at 1:23 PM Marco Neumann 
wrote:


how quickly can you hook geosparql into the release?

this would make lucene spatial obsolete in the next release.  has Greg
released performance benchmarks for his implementation? as I said I will
take a look at it over the weekend when time permits.

On Fri, Nov 30, 2018 at 11:02 AM Andy Seaborne  wrote:


We could retire jena-spatial immediately after 3.10.0 - given the Lucene
change that might be smoother, one release with updated dependencies.

If that is the way forward, I think it is (mildly) better to take it out
of the Fuseki/Full build in 3.10.0.

  Andy

On 29/11/2018 17:00, Marco Neumann wrote:

I will have to look into that I guess since I am frequent user of

spatial

data.

why not go to 7.5? was there an incompatibility?

On Thu 29. Nov 2018 at 16:53, Andy Seaborne  wrote:


Jena 3.1.0 would be around the end of the year. I'd like to make use of
Greg's GeoSPARQL project the "headline" item for the release and to
retire jena-spatial in 3.10.0 as an indication of this.

Because retirement is a new process for the project, I'm sending this
first 3.10.0 message quite early to give us discussion time.

== Retirements

We have talked about this before but not actually done anything. See
separate thread for discussion on retirement process and for the first
modules:

jena-spatial
jena-fuseki1
jena-csv

== Headlines

JENA-664 : GeoSPARQL support

I'd like to make use of Greg's GeoSPARQL project the "headline" item

for

the release and to retire jena-spatial in 3.10.0 as an indication of

this.

JENA-1621 : Lucene upgrade to 7.4
  May need to reload lucene indexes.
(e.g. the lucene index was create originally with Lucene v5.x (prior
Jena 3.3.0). See Lucene upgrade tool.
https://lucene.apache.org/solr/guide/7_4/indexupgrader-tool.html

JENA-1623 : Fuseki security
JENA-1627 : HTTP support
https://issues.apache.org/jira/browse/JENA-1623


http://jena.staging.apache.org/documentation/fuseki2/data-access-control

== JIRA:

31 currently.

https://s.apache.org/jena-3.10.0-jira

== Updates

Only plugins. JENA-1624

surefire : 2.21.0 -> 2.22.1 (+ SUREFIRE-1588)
compiler : 3.7.0 -> 3.8.0
shade: 3.1.0 -> 3.2.0

  Andy



--


---
Marco Neumann
KONA




Re: Toward Jena 3.10.0

2018-12-03 Thread Marco Neumann
so I've had a look at this and while I think geosparql-jena is a very
welcomed contribution to the jena project I don't think we should rush with
the retirement of  jena-spatial at this point as Greg's approach will
require users to make changes to their existing data.

I will engage Greg on us...@jena.apache.org again to clarify a few things
and hopefully get more people involved in this conversation around spatial,
geosparql and jena.



On Fri, Nov 30, 2018 at 1:23 PM Marco Neumann 
wrote:

> how quickly can you hook geosparql into the release?
>
> this would make lucene spatial obsolete in the next release.  has Greg
> released performance benchmarks for his implementation? as I said I will
> take a look at it over the weekend when time permits.
>
> On Fri, Nov 30, 2018 at 11:02 AM Andy Seaborne  wrote:
>
>> We could retire jena-spatial immediately after 3.10.0 - given the Lucene
>> change that might be smoother, one release with updated dependencies.
>>
>> If that is the way forward, I think it is (mildly) better to take it out
>> of the Fuseki/Full build in 3.10.0.
>>
>>  Andy
>>
>> On 29/11/2018 17:00, Marco Neumann wrote:
>> > I will have to look into that I guess since I am frequent user of
>> spatial
>> > data.
>> >
>> > why not go to 7.5? was there an incompatibility?
>> >
>> > On Thu 29. Nov 2018 at 16:53, Andy Seaborne  wrote:
>> >
>> >> Jena 3.1.0 would be around the end of the year. I'd like to make use of
>> >> Greg's GeoSPARQL project the "headline" item for the release and to
>> >> retire jena-spatial in 3.10.0 as an indication of this.
>> >>
>> >> Because retirement is a new process for the project, I'm sending this
>> >> first 3.10.0 message quite early to give us discussion time.
>> >>
>> >> == Retirements
>> >>
>> >> We have talked about this before but not actually done anything. See
>> >> separate thread for discussion on retirement process and for the first
>> >> modules:
>> >>
>> >> jena-spatial
>> >> jena-fuseki1
>> >> jena-csv
>> >>
>> >> == Headlines
>> >>
>> >> JENA-664 : GeoSPARQL support
>> >>
>> >> I'd like to make use of Greg's GeoSPARQL project the "headline" item
>> for
>> >> the release and to retire jena-spatial in 3.10.0 as an indication of
>> this.
>> >>
>> >> JENA-1621 : Lucene upgrade to 7.4
>> >>  May need to reload lucene indexes.
>> >> (e.g. the lucene index was create originally with Lucene v5.x (prior
>> >> Jena 3.3.0). See Lucene upgrade tool.
>> >> https://lucene.apache.org/solr/guide/7_4/indexupgrader-tool.html
>> >>
>> >> JENA-1623 : Fuseki security
>> >> JENA-1627 : HTTP support
>> >> https://issues.apache.org/jira/browse/JENA-1623
>> >>
>> http://jena.staging.apache.org/documentation/fuseki2/data-access-control
>> >>
>> >> == JIRA:
>> >>
>> >> 31 currently.
>> >>
>> >> https://s.apache.org/jena-3.10.0-jira
>> >>
>> >> == Updates
>> >>
>> >> Only plugins. JENA-1624
>> >>
>> >> surefire : 2.21.0 -> 2.22.1 (+ SUREFIRE-1588)
>> >> compiler : 3.7.0 -> 3.8.0
>> >> shade: 3.1.0 -> 3.2.0
>> >>
>> >>  Andy
>> >>
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>

-- 


---
Marco Neumann
KONA