[
https://issues.apache.org/jira/browse/JENA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lorenz Bühmann updated JENA-2311:
---------------------------------
Description:
Using a GeoSPARQL query with a geospatial property function, e.g.
{code:java}
SELECT * {
:x geo:hasGeometry ?geo1 .
?s2 geo:hasGeometry ?geo2 .
?geo1 geo:sfContains ?geo2
}
{code}
leads to heavy memory consumption for larger datasets - and we're not talking
about big data at all. Imagine given a polygon and checking for millions of
geometries for containment in the polygon.
In the {{QueryRewriteIndex}} class for caching a key will be generated, but
this is horribly expensive given that the string representation of Geometries
is called millions of times leading millions of Byte arrays being created
leading a to a possible OOM exception - we got it with 8GB assigned.
The key generation for reference:
{code:java}
String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR +
predicate.getURI() + KEY_SEPARATOR +
objectGeometryLiteral.getLiteralLexicalForm();
{code}
My suggestion is to use a separate {{Node -> Integer}} (or {{Long}}?) Guava
cache and use the long values instead to generate the cache key. Or any other
more efficient datastructure, not even sure if a String is necessary?
We tried some fix which works for us and keeps the memory consumption stable:
{code:java}
private LoadingCache<Node, Integer> nodeIDCache;
private AtomicInteger cacheCounter;
...
cacheCounter = new AtomicInteger(0);
CacheBuilder<Object, Object> builder = CacheBuilder.newBuilder();
if (maxSize > 0) {
builder = builder.maximumSize(maxSize);
}
if (expiryInterval > 0) {
builder = builder.expireAfterWrite(expiryInterval,
TimeUnit.MILLISECONDS);
}
nodeIDCache = builder.build(
new CacheLoader<>() {
public Integer load(Node key) {
return cacheCounter.incrementAndGet();
}
});
{code}
was:
Using a GeoSPARQL query with a geospatial property function, e.g.
{code:java}
SELECT * {
:x geo:hasGeometry ?geo1 .
?s2 geo:hasGeometry ?geo2 .
?geo1 geo:sfContains ?geo2
}
{code}
leads to heavy memory consumption for larger datasets - and we're not talking
about big data at all. Imagine given a polygon and checking for millions of
geometries for containment in the polygon.
In the {{QueryRewriteIndex}} class for caching a key will be generated, but
this is horribly expensive given that the string representation of Geometries
is called millions of times leading millions of Byte arrays being created
leading a to a possible OOM exception - we got it with 8GB assigned.
The key generation for reference:
{code:java}
String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR +
predicate.getURI() + KEY_SEPARATOR +
objectGeometryLiteral.getLiteralLexicalForm();
{code}
My suggestion is to use a separate {{Node -> Integer}} (or {{Long}}?) Guava
cache and use the long values instead to generate the cache key. Or any other
more efficient datastructure, not even sure if a String is necessary?
We tried some fix which works for us and keeps the memory consumption stable:
{code:java}
private LoadingCache<Node, Integer> nodeIDCache;
private AtomicInteger cacheCounter;
...
cacheCounter = new AtomicInteger(0);
nodeIDCache = CacheBuilder.newBuilder()
.expireAfterWrite(MAP_EXPIRY_INTERVAL_DEFAULT,
TimeUnit.MILLISECONDS)
.build(
new CacheLoader<>() {
public Integer load(Node key) {
return cacheCounter.incrementAndGet();
}
});
{code}
> query rewrite index does too expensive caching on geo literals
> --------------------------------------------------------------
>
> Key: JENA-2311
> URL: https://issues.apache.org/jira/browse/JENA-2311
> Project: Apache Jena
> Issue Type: Improvement
> Components: GeoSPARQL
> Affects Versions: Jena 4.4.0
> Reporter: Lorenz Bühmann
> Priority: Major
>
> Using a GeoSPARQL query with a geospatial property function, e.g.
> {code:java}
> SELECT * {
> :x geo:hasGeometry ?geo1 .
> ?s2 geo:hasGeometry ?geo2 .
> ?geo1 geo:sfContains ?geo2
> }
> {code}
> leads to heavy memory consumption for larger datasets - and we're not talking
> about big data at all. Imagine given a polygon and checking for millions of
> geometries for containment in the polygon.
> In the {{QueryRewriteIndex}} class for caching a key will be generated, but
> this is horribly expensive given that the string representation of Geometries
> is called millions of times leading millions of Byte arrays being created
> leading a to a possible OOM exception - we got it with 8GB assigned.
> The key generation for reference:
> {code:java}
> String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR +
> predicate.getURI() + KEY_SEPARATOR +
> objectGeometryLiteral.getLiteralLexicalForm();
> {code}
> My suggestion is to use a separate {{Node -> Integer}} (or {{Long}}?) Guava
> cache and use the long values instead to generate the cache key. Or any other
> more efficient datastructure, not even sure if a String is necessary?
> We tried some fix which works for us and keeps the memory consumption stable:
> {code:java}
> private LoadingCache<Node, Integer> nodeIDCache;
> private AtomicInteger cacheCounter;
> ...
> cacheCounter = new AtomicInteger(0);
> CacheBuilder<Object, Object> builder = CacheBuilder.newBuilder();
> if (maxSize > 0) {
> builder = builder.maximumSize(maxSize);
> }
> if (expiryInterval > 0) {
> builder = builder.expireAfterWrite(expiryInterval,
> TimeUnit.MILLISECONDS);
> }
> nodeIDCache = builder.build(
> new CacheLoader<>() {
> public Integer load(Node key) {
> return cacheCounter.incrementAndGet();
> }
> });
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)