[ https://issues.apache.org/jira/browse/JENA-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lorenz Bühmann updated JENA-2311: --------------------------------- Description: Using a GeoSPARQL query with a geospatial property function, e.g. {code:java} SELECT * { :x geo:hasGeometry ?geo1 . ?s2 geo:hasGeometry ?geo2 . ?geo1 geo:sfContains ?geo2 } {code} leads to heavy memory consumption for larger datasets - and we're not talking about big data at all. Imagine given a polygon and checking for millions of geometries for containment in the polygon. In the {{QueryRewriteIndex}} class for caching a key will be generated, but this is horribly expensive given that the string representation of Geometries is called millions of times leading millions of Byte arrays being created leading a to a possible OOM exception - we got it with 8GB assigned. The key generation for reference: {code:java} String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR + predicate.getURI() + KEY_SEPARATOR + objectGeometryLiteral.getLiteralLexicalForm(); {code} My suggestion is to use a separate {{Node -> Integer}} (or {{Long}}?) Guava cache and use the long values instead to generate the cache key. Or any other more efficient datastructure, not even sure if a String is necessary? We tried some fix which works for us and keeps the memory consumption stable: {code:java} private LoadingCache<Node, Integer> nodeIDCache; private AtomicInteger cacheCounter; ... cacheCounter = new AtomicInteger(0); CacheBuilder<Object, Object> builder = CacheBuilder.newBuilder(); if (maxSize > 0) { builder = builder.maximumSize(maxSize); } if (expiryInterval > 0) { builder = builder.expireAfterWrite(expiryInterval, TimeUnit.MILLISECONDS); } nodeIDCache = builder.build( new CacheLoader<>() { public Integer load(Node key) { return cacheCounter.incrementAndGet(); } }); {code} was: Using a GeoSPARQL query with a geospatial property function, e.g. {code:java} SELECT * { :x geo:hasGeometry ?geo1 . ?s2 geo:hasGeometry ?geo2 . ?geo1 geo:sfContains ?geo2 } {code} leads to heavy memory consumption for larger datasets - and we're not talking about big data at all. Imagine given a polygon and checking for millions of geometries for containment in the polygon. In the {{QueryRewriteIndex}} class for caching a key will be generated, but this is horribly expensive given that the string representation of Geometries is called millions of times leading millions of Byte arrays being created leading a to a possible OOM exception - we got it with 8GB assigned. The key generation for reference: {code:java} String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR + predicate.getURI() + KEY_SEPARATOR + objectGeometryLiteral.getLiteralLexicalForm(); {code} My suggestion is to use a separate {{Node -> Integer}} (or {{Long}}?) Guava cache and use the long values instead to generate the cache key. Or any other more efficient datastructure, not even sure if a String is necessary? We tried some fix which works for us and keeps the memory consumption stable: {code:java} private LoadingCache<Node, Integer> nodeIDCache; private AtomicInteger cacheCounter; ... cacheCounter = new AtomicInteger(0); nodeIDCache = CacheBuilder.newBuilder() .expireAfterWrite(MAP_EXPIRY_INTERVAL_DEFAULT, TimeUnit.MILLISECONDS) .build( new CacheLoader<>() { public Integer load(Node key) { return cacheCounter.incrementAndGet(); } }); {code} > query rewrite index does too expensive caching on geo literals > -------------------------------------------------------------- > > Key: JENA-2311 > URL: https://issues.apache.org/jira/browse/JENA-2311 > Project: Apache Jena > Issue Type: Improvement > Components: GeoSPARQL > Affects Versions: Jena 4.4.0 > Reporter: Lorenz Bühmann > Priority: Major > > Using a GeoSPARQL query with a geospatial property function, e.g. > {code:java} > SELECT * { > :x geo:hasGeometry ?geo1 . > ?s2 geo:hasGeometry ?geo2 . > ?geo1 geo:sfContains ?geo2 > } > {code} > leads to heavy memory consumption for larger datasets - and we're not talking > about big data at all. Imagine given a polygon and checking for millions of > geometries for containment in the polygon. > In the {{QueryRewriteIndex}} class for caching a key will be generated, but > this is horribly expensive given that the string representation of Geometries > is called millions of times leading millions of Byte arrays being created > leading a to a possible OOM exception - we got it with 8GB assigned. > The key generation for reference: > {code:java} > String key = subjectGeometryLiteral.getLiteralLexicalForm() + KEY_SEPARATOR + > predicate.getURI() + KEY_SEPARATOR + > objectGeometryLiteral.getLiteralLexicalForm(); > {code} > My suggestion is to use a separate {{Node -> Integer}} (or {{Long}}?) Guava > cache and use the long values instead to generate the cache key. Or any other > more efficient datastructure, not even sure if a String is necessary? > We tried some fix which works for us and keeps the memory consumption stable: > {code:java} > private LoadingCache<Node, Integer> nodeIDCache; > private AtomicInteger cacheCounter; > ... > cacheCounter = new AtomicInteger(0); > CacheBuilder<Object, Object> builder = CacheBuilder.newBuilder(); > if (maxSize > 0) { > builder = builder.maximumSize(maxSize); > } > if (expiryInterval > 0) { > builder = builder.expireAfterWrite(expiryInterval, > TimeUnit.MILLISECONDS); > } > nodeIDCache = builder.build( > new CacheLoader<>() { > public Integer load(Node key) { > return cacheCounter.incrementAndGet(); > } > }); > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)