[jira] [Updated] (JENA-2311) query rewrite index does too expensive caching on geo literals

Jira Mon, 14 Mar 2022 01:43:05 -0700


     [ 
https://issues.apache.org/jira/browse/JENA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lorenz Bühmann updated JENA-2311:
---------------------------------
    Description: 
Using a GeoSPARQL query with a geospatial property function, e.g.


{code:java}
SELECT * {
:x geo:hasGeometry ?geo1 .
?s2 geo:hasGeometry ?geo2 .
?geo1 geo:sfContains ?geo2
}
{code}


leads to heavy memory consumption for larger datasets - and we're not talking 
about big data at all. Imagine given a polygon and checking for millions of 
geometries for containment in the polygon.

In the {{QueryRewriteIndex}} class for caching a key will be generated, but 
this is horribly expensive given that the string representation of Geometries 
is called millions of times leading millions of Byte arrays being created 
leading a to a possible OOM exception - we got it with 8GB assigned.
The key generation for reference:

{code:java}
String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR + 
predicate.getURI() + KEY_SEPARATOR + 
objectGeometryLiteral.getLiteralLexicalForm();
{code}

My suggestion is to use a separate {{Node -> Integer}} (or {{Long}}?) Guava 
cache and use the long values instead to generate the cache key. Or any other 
more efficient datastructure, not even sure if a String is necessary?

We tried some fix which works for us and keeps the memory consumption stable:

{code:java}
 private LoadingCache<Node, Integer> nodeIDCache;
 private AtomicInteger cacheCounter;

...
cacheCounter = new AtomicInteger(0);
            nodeIDCache = CacheBuilder.newBuilder()
                    .expireAfterWrite(MAP_EXPIRY_INTERVAL_DEFAULT, 
TimeUnit.MILLISECONDS)
                    .build(
                            new CacheLoader<>() {
                                public Integer load(Node key) {
                                    return cacheCounter.incrementAndGet();
                                }
                            });
{code}





  was:
Using a GeoSPARQL query with a geospatial property function, e.g.


{code:java}
SELECT * {
:x geo:hasGeometry ?geo1 .
?s2 geo:hasGeometry ?geo2 .
?geo1 geo:sfContains ?geo2
}
{code}


leads to heavy memory consumption for larger datasets - and we're not talking 
about big data at all. Imagine given a polygon and checking for millions of 
geometries for containment in the polygon.

In the {{QueryRewriteIndex}} class for caching a key will be generated, but 
this is horribly expensive given that the string representation of Geometries 
is called millions of times leading millions of Byte arrays being created 
leading a to a possible OOM exception - we got it with 8GB assigned.
The key generation for reference:

{code:java}
String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR + 
predicate.getURI() + KEY_SEPARATOR + 
objectGeometryLiteral.getLiteralLexicalForm();
{code}

My suggestion is to use a separate {{Node -> Integer}} (or {{Long}} Guava cache 
and use the long values instead to generate the cache key. Or any other more 
efficient datastructure, not even sure if a String is necessary?





> query rewrite index does too expensive caching on geo literals
> --------------------------------------------------------------
>
>                 Key: JENA-2311
>                 URL: https://issues.apache.org/jira/browse/JENA-2311
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: GeoSPARQL
>    Affects Versions: Jena 4.4.0
>            Reporter: Lorenz Bühmann
>            Priority: Major
>
> Using a GeoSPARQL query with a geospatial property function, e.g.
> {code:java}
> SELECT * {
> :x geo:hasGeometry ?geo1 .
> ?s2 geo:hasGeometry ?geo2 .
> ?geo1 geo:sfContains ?geo2
> }
> {code}
> leads to heavy memory consumption for larger datasets - and we're not talking 
> about big data at all. Imagine given a polygon and checking for millions of 
> geometries for containment in the polygon.
> In the {{QueryRewriteIndex}} class for caching a key will be generated, but 
> this is horribly expensive given that the string representation of Geometries 
> is called millions of times leading millions of Byte arrays being created 
> leading a to a possible OOM exception - we got it with 8GB assigned.
> The key generation for reference:
> {code:java}
> String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR + 
> predicate.getURI() + KEY_SEPARATOR + 
> objectGeometryLiteral.getLiteralLexicalForm();
> {code}
> My suggestion is to use a separate {{Node -> Integer}} (or {{Long}}?) Guava 
> cache and use the long values instead to generate the cache key. Or any other 
> more efficient datastructure, not even sure if a String is necessary?
> We tried some fix which works for us and keeps the memory consumption stable:
> {code:java}
>  private LoadingCache<Node, Integer> nodeIDCache;
>  private AtomicInteger cacheCounter;
> ...
> cacheCounter = new AtomicInteger(0);
>             nodeIDCache = CacheBuilder.newBuilder()
>                     .expireAfterWrite(MAP_EXPIRY_INTERVAL_DEFAULT, 
> TimeUnit.MILLISECONDS)
>                     .build(
>                             new CacheLoader<>() {
>                                 public Integer load(Node key) {
>                                     return cacheCounter.incrementAndGet();
>                                 }
>                             });
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (JENA-2311) query rewrite index does too expensive caching on geo literals

Reply via email to