Re: GeoSPARQL process

2019-04-15 Thread ajs6f
Thanks, Greg, this is very detailed. Once the new module is in and settled and 
we have a release or two to learn from, I will take a closer look at the usage 
of this code to understand how it differs from the kind of caching that occurs 
elsewhere in Jena.

ajs6f

> On Apr 14, 2019, at 6:21 AM, Greg Albiston  wrote:
> 
> Hi,
> 
> There are a lot of permutations that a GeoSPARQL query could take which
> can generate different values that may or may not be useful later on.
> The general strategy is to keep what is generated for a while and if
> isn't used then drop it. I don't think any of the Cache implementations
> offer this or a suitable alternative.
> 
> The expiring-map removes entries that haven't been reused after a period
> of time. The duration to retain, rate of checking and maximum size can
> all be set. It is used for three purposes:
> 
> - The Geometry Wrapper object resulting from de-serialising the Geometry
> Literals.
> - The transformed Geometry Wrapper object from changing the spatial
> reference system.
> - The result of a spatial relation between two Geometry Literals to
> avoid re-testing when Query Re-writing is applied.
> 
> Most of the GeoSPARQL functions are between two Geometry Literals, so
> one could be needed in the next iteration of the query and the other
> could be needed later.
> 
> The first purpose offers the biggest impact on performance as there are
> additional de-serialising of the Geometry Literal while Jena is
> processing the query. Complex shages, e.g. polygons, can be very costly
> to extract.
> 
> The second purpose offers most benefit when complex shapes need
> transforming. These transformations may be needed again during this
> query but not the next. e.g. dataset is in SRS A. Query 1 is a
> comparison with a set of values in SRS B. Query 2 then is a comparison
> with a set of values in SRS C. The results from Query 1 are useless and
> may never be needed again.
> 
> The third purpose is due to GeoSPARQL allowing query re-writing where
> the Geometry Literal isn't specified and instead Features and Geometries
> are used, so a single query could test the same spatial relations upto
> four times depending on bindings.
> 
> The expiring-map is allowed to fill up while the query is processing and
> then drops entries that aren't reused (in batches) or once the query
> completes. Once it is full, new entries are quickly rejected but space
> is freed up later from those entries not being re-used. A user with a
> small dataset can cache everything while a large dataset can choose to
> constrain it to get some benefit from caching without consuming vast
> junks of memory.
> 
> I tried using the Apache Collections 4 LRUMap and it made performance
> worse once it was filled (at a guess due to "one out, one in" and
> constant searching). I only found one Java implementation of a time
> based cache. It seemed excessive to have the whole dependency for one
> class and it wasn't as flexible as required.
> 
> Hopefully this clarifies why the expiring-map approach was adopted.
> 
> Thanks,
> 
> Greg
> 
> On 10/04/2019 16:50, ajs6f wrote:
>> Just out of curiosity, Greg, what is the functionality offered by Expiring 
>> Map that isn't offered by Jena's already-extant oaj.atlas.lib.Cache 
>> implementations? Is it the ability to manually trigger expirations?
>> 
>> ajs6f
>> 
>>> On Apr 9, 2019, at 12:02 PM, Andy Seaborne  wrote:
>>> 
>>> [INFO] |  \- io.github.galbiston:expiring-map:jar:1.0.2:compile



Re: GeoSPARQL process

2019-04-15 Thread Andy Seaborne

Hi Greg,

Neither of those (jdom2,rdf-tables) are problems or need anythign does 
before we can release Jena with GeoSPARQL in it. They can be changed, or 
not, later.


For timing: everyone is busy!

We could release 3.11.0 ASAP (it's 4 months since 3.10.0) and 
immediately start on 3.12.0. I have some time to help with a 3.12 ... 
hoping to get it all done during May.


Or we could just accept a delay to 3.11.0.

It is the usual tension between perfect and timely with volunteer time!


What needs to happen for geosparql is contribution:

1/ The code should be under java package org.apache.jena
I suggested:
  io.github.galbiston.geosparql_jena
 => org.apache.jena.geosparql
  io.github.galbiston.geosparql_fuseki
 => org.apache.jena.fuseki.geosparql

2/ Modules:

  jena-geosparql
  jena-fuseki/jena-fuseki-geospatial


3/ A "pull request" from Greg. That makes it clear it is being 
contributed.


then the project can:

4/ A NOTICE files for combined fuseki jars. It goes in the code tree at
src/main/resources/META-INF and ends up in the shaded jar.  I can help 
with that.


5/ POM files ... because the build is maven.
(where the ones I put on gist OK?)


It is not necessary for release to do every piece of tidying up like 
dependency management of versions in the top pom.xml.


Andy

On 14/04/2019 10:01, Greg Albiston wrote:

Hi,

- rdf-tables: This could be taken out if problematic. It is a CSV/TSV to
RDF converter to provide another route to load geospatial data and was
useful on another project. Given that jena-csv has been deprecated,
there might not be the demand for its inclusion.

- jdom2: This is only used for GML reading/writing. Could look into
replacing with any XML library already used by Jena. Recently found that
Apache SIS offers a GML parser so will investigate whether this can be
used (would offer more flexibility and maintenance with the GML versions).

Thanks,

Greg

On 10/04/2019 22:15, Andy Seaborne wrote:



On 09/04/2019 17:02, Andy Seaborne wrote:

Here are the new dependencies:

[INFO] |  +- org.apache.sis.core:sis-referencing:jar:0.8:compile
[INFO] |  |  +- javax.measure:unit-api:jar:1.0:compile
[INFO] |  |  \- org.opengis:geoapi:jar:3.0.1:compile

via the org.apache.sis

org.opengis:geoapi
   https://github.com/opengeospatial/geoapi
   A form of BSD license.

javax.measure:unit-api
   https://github.com/unitsofmeasurement/unit-api
   BSD 3-clause.

[INFO] |  +- org.locationtech.jts:jts-core:jar:1.16.1:compile

Eclipse Distribution License 1.0


EDL 1.0 is cat-A

Treat like BSD - NOTICE entry when repackage needed.

Link to http://www.eclipse.org/org/documents/edl-v10.php
is acceptable. (generally, links instead of a copy are now considered
acceptable).



[INFO] |  +- org.jdom:jdom2:jar:2.0.6:compile

Modified BSD - it does not appear to be the problematic, old BSD
4-clause. Seems like 3-clause with clause 3 is split in two.

Needs more eyes on it.


https://issues.apache.org/jira/browse/LEGAL-204

It is the BSD 2-clause license with two extra clauses about name usage.

NOTICE entry when repackage needed.
https://github.com/hunterhacker/jdom/blob/master/LICENSE.txt



[INFO] |  \- io.github.galbiston:expiring-map:jar:1.0.2:compile
[INFO] +- io.github.galbiston:rdf-tables:jar:1.0.4:compile

AL2 :-)

[INFO] |  +- com.opencsv:opencsv:jar:3.9:runtime

https://sourceforge.net/p/opencsv/source/ci/master/tree/LICENSE
AL2

[INFO] +- com.beust:jcommander:jar:1.72:compile

https://github.com/cbeust/jcommander
AL2

 Andy


On 08/04/2019 17:29, Andy Seaborne wrote:
 > Added a POM file for jena-fuseki-geosparql to the same gist:
 >
 > https://gist.github.com/afs/c6c291812bbc96fe55ac64ecdd1edfe4
 >
 > Had to do some exclusions on rdf-tables.
 >
 >  Andy
 >


Re: GeoSPARQL process

2019-04-14 Thread Greg Albiston

Hi,

There are a lot of permutations that a GeoSPARQL query could take which
can generate different values that may or may not be useful later on.
The general strategy is to keep what is generated for a while and if
isn't used then drop it. I don't think any of the Cache implementations
offer this or a suitable alternative.

The expiring-map removes entries that haven't been reused after a period
of time. The duration to retain, rate of checking and maximum size can
all be set. It is used for three purposes:

- The Geometry Wrapper object resulting from de-serialising the Geometry
Literals.
- The transformed Geometry Wrapper object from changing the spatial
reference system.
- The result of a spatial relation between two Geometry Literals to
avoid re-testing when Query Re-writing is applied.

Most of the GeoSPARQL functions are between two Geometry Literals, so
one could be needed in the next iteration of the query and the other
could be needed later.

The first purpose offers the biggest impact on performance as there are
additional de-serialising of the Geometry Literal while Jena is
processing the query. Complex shages, e.g. polygons, can be very costly
to extract.

The second purpose offers most benefit when complex shapes need
transforming. These transformations may be needed again during this
query but not the next. e.g. dataset is in SRS A. Query 1 is a
comparison with a set of values in SRS B. Query 2 then is a comparison
with a set of values in SRS C. The results from Query 1 are useless and
may never be needed again.

The third purpose is due to GeoSPARQL allowing query re-writing where
the Geometry Literal isn't specified and instead Features and Geometries
are used, so a single query could test the same spatial relations upto
four times depending on bindings.

The expiring-map is allowed to fill up while the query is processing and
then drops entries that aren't reused (in batches) or once the query
completes. Once it is full, new entries are quickly rejected but space
is freed up later from those entries not being re-used. A user with a
small dataset can cache everything while a large dataset can choose to
constrain it to get some benefit from caching without consuming vast
junks of memory.

I tried using the Apache Collections 4 LRUMap and it made performance
worse once it was filled (at a guess due to "one out, one in" and
constant searching). I only found one Java implementation of a time
based cache. It seemed excessive to have the whole dependency for one
class and it wasn't as flexible as required.

Hopefully this clarifies why the expiring-map approach was adopted.

Thanks,

Greg

On 10/04/2019 16:50, ajs6f wrote:

Just out of curiosity, Greg, what is the functionality offered by Expiring Map 
that isn't offered by Jena's already-extant oaj.atlas.lib.Cache 
implementations? Is it the ability to manually trigger expirations?

ajs6f


On Apr 9, 2019, at 12:02 PM, Andy Seaborne  wrote:

[INFO] |  \- io.github.galbiston:expiring-map:jar:1.0.2:compile


Re: GeoSPARQL process

2019-04-14 Thread Greg Albiston

Hi,

- rdf-tables: This could be taken out if problematic. It is a CSV/TSV to
RDF converter to provide another route to load geospatial data and was
useful on another project. Given that jena-csv has been deprecated,
there might not be the demand for its inclusion.

- jdom2: This is only used for GML reading/writing. Could look into
replacing with any XML library already used by Jena. Recently found that
Apache SIS offers a GML parser so will investigate whether this can be
used (would offer more flexibility and maintenance with the GML versions).

Thanks,

Greg

On 10/04/2019 22:15, Andy Seaborne wrote:



On 09/04/2019 17:02, Andy Seaborne wrote:

Here are the new dependencies:

[INFO] |  +- org.apache.sis.core:sis-referencing:jar:0.8:compile
[INFO] |  |  +- javax.measure:unit-api:jar:1.0:compile
[INFO] |  |  \- org.opengis:geoapi:jar:3.0.1:compile

via the org.apache.sis

org.opengis:geoapi
   https://github.com/opengeospatial/geoapi
   A form of BSD license.

javax.measure:unit-api
   https://github.com/unitsofmeasurement/unit-api
   BSD 3-clause.

[INFO] |  +- org.locationtech.jts:jts-core:jar:1.16.1:compile

Eclipse Distribution License 1.0


EDL 1.0 is cat-A

Treat like BSD - NOTICE entry when repackage needed.

Link to http://www.eclipse.org/org/documents/edl-v10.php
is acceptable. (generally, links instead of a copy are now considered
acceptable).



[INFO] |  +- org.jdom:jdom2:jar:2.0.6:compile

Modified BSD - it does not appear to be the problematic, old BSD
4-clause. Seems like 3-clause with clause 3 is split in two.

Needs more eyes on it.


https://issues.apache.org/jira/browse/LEGAL-204

It is the BSD 2-clause license with two extra clauses about name usage.

NOTICE entry when repackage needed.
https://github.com/hunterhacker/jdom/blob/master/LICENSE.txt



[INFO] |  \- io.github.galbiston:expiring-map:jar:1.0.2:compile
[INFO] +- io.github.galbiston:rdf-tables:jar:1.0.4:compile

AL2 :-)

[INFO] |  +- com.opencsv:opencsv:jar:3.9:runtime

https://sourceforge.net/p/opencsv/source/ci/master/tree/LICENSE
AL2

[INFO] +- com.beust:jcommander:jar:1.72:compile

https://github.com/cbeust/jcommander
AL2

 Andy


On 08/04/2019 17:29, Andy Seaborne wrote:
 > Added a POM file for jena-fuseki-geosparql to the same gist:
 >
 > https://gist.github.com/afs/c6c291812bbc96fe55ac64ecdd1edfe4
 >
 > Had to do some exclusions on rdf-tables.
 >
 >  Andy
 >


Re: GeoSPARQL process

2019-04-10 Thread Andy Seaborne




On 09/04/2019 17:02, Andy Seaborne wrote:

Here are the new dependencies:

[INFO] |  +- org.apache.sis.core:sis-referencing:jar:0.8:compile
[INFO] |  |  +- javax.measure:unit-api:jar:1.0:compile
[INFO] |  |  \- org.opengis:geoapi:jar:3.0.1:compile

via the org.apache.sis

org.opengis:geoapi
   https://github.com/opengeospatial/geoapi
   A form of BSD license.

javax.measure:unit-api
   https://github.com/unitsofmeasurement/unit-api
   BSD 3-clause.

[INFO] |  +- org.locationtech.jts:jts-core:jar:1.16.1:compile

Eclipse Distribution License 1.0


EDL 1.0 is cat-A

Treat like BSD - NOTICE entry when repackage needed.

Link to http://www.eclipse.org/org/documents/edl-v10.php
is acceptable. (generally, links instead of a copy are now considered 
acceptable).




[INFO] |  +- org.jdom:jdom2:jar:2.0.6:compile

Modified BSD - it does not appear to be the problematic, old BSD 
4-clause. Seems like 3-clause with clause 3 is split in two.


Needs more eyes on it.


https://issues.apache.org/jira/browse/LEGAL-204

It is the BSD 2-clause license with two extra clauses about name usage.

NOTICE entry when repackage needed.
https://github.com/hunterhacker/jdom/blob/master/LICENSE.txt



[INFO] |  \- io.github.galbiston:expiring-map:jar:1.0.2:compile
[INFO] +- io.github.galbiston:rdf-tables:jar:1.0.4:compile

AL2 :-)

[INFO] |  +- com.opencsv:opencsv:jar:3.9:runtime

https://sourceforge.net/p/opencsv/source/ci/master/tree/LICENSE
AL2

[INFO] +- com.beust:jcommander:jar:1.72:compile

https://github.com/cbeust/jcommander
AL2

     Andy


On 08/04/2019 17:29, Andy Seaborne wrote:
 > Added a POM file for jena-fuseki-geosparql to the same gist:
 >
 > https://gist.github.com/afs/c6c291812bbc96fe55ac64ecdd1edfe4
 >
 > Had to do some exclusions on rdf-tables.
 >
 >  Andy
 >


Re: GeoSPARQL process

2019-04-10 Thread ajs6f
Just out of curiosity, Greg, what is the functionality offered by Expiring Map 
that isn't offered by Jena's already-extant oaj.atlas.lib.Cache 
implementations? Is it the ability to manually trigger expirations?

ajs6f

> On Apr 9, 2019, at 12:02 PM, Andy Seaborne  wrote:
> 
> [INFO] |  \- io.github.galbiston:expiring-map:jar:1.0.2:compile



GeoSPARQL process

2019-04-09 Thread Andy Seaborne

Here are the new dependencies:

[INFO] |  +- org.apache.sis.core:sis-referencing:jar:0.8:compile
[INFO] |  |  +- javax.measure:unit-api:jar:1.0:compile
[INFO] |  |  \- org.opengis:geoapi:jar:3.0.1:compile

via the org.apache.sis

org.opengis:geoapi
  https://github.com/opengeospatial/geoapi
  A form of BSD license.

javax.measure:unit-api
  https://github.com/unitsofmeasurement/unit-api
  BSD 3-clause.

[INFO] |  +- org.locationtech.jts:jts-core:jar:1.16.1:compile

Eclipse Distribution License 1.0

[INFO] |  +- org.jdom:jdom2:jar:2.0.6:compile

Modified BSD - it does not appear to be the problematic, old BSD 
4-clause. Seems like 3-clause with clause 3 is split in two.


Needs more eyes on it.

[INFO] |  \- io.github.galbiston:expiring-map:jar:1.0.2:compile
[INFO] +- io.github.galbiston:rdf-tables:jar:1.0.4:compile

AL2 :-)

[INFO] |  +- com.opencsv:opencsv:jar:3.9:runtime

https://sourceforge.net/p/opencsv/source/ci/master/tree/LICENSE
AL2

[INFO] +- com.beust:jcommander:jar:1.72:compile

https://github.com/cbeust/jcommander
AL2

Andy


On 08/04/2019 17:29, Andy Seaborne wrote:
> Added a POM file for jena-fuseki-geosparql to the same gist:
>
> https://gist.github.com/afs/c6c291812bbc96fe55ac64ecdd1edfe4
>
> Had to do some exclusions on rdf-tables.
>
>  Andy
>