On 31/01/13 14:58, Simon Helsen wrote:
Hi guys,
I have a generic question about having large strings as objects in
triples. It is not entirely clear to me what the ramifications are if TDB
indexes triples with very large objects (typically of some string type).
We currently have an internal discussion about this because it seems that
in the past we essentially blocked triples with a very large string object
to end up in TDB in the first place (right now, the artificial limit is a
string length of 1024). In most cases, clients would not put any fancy
filters on such large strings in their sparql query, but they would still
want to retrieve the large string object. Still, even in this use-case, it
is not clear how this would negatively affect the performance both in
terms of memory and cpu.
TDB has no internal limits on literal lexical form length. It does not
affect indexing (indexes are on NodeId - fixed length 8 bytes). It does
affect loading (more bytes!) and the total system resources (beware of
OOME). If 1K+ literals form a significant amount of the database, it
will be slower as the cumulation of all the costs.
But the "not in most cases" case can be very bad - searching them by
regex is expensive as I bet that may be "find all such that regex".
That is expensive. Consider using an additional index, LARQ style.
If the clients really are just storing large literals in RDF, not
searching, then putting an indirection and keeping them in a KeyValue
blob store
You will need to try to know the exact impact in your usage 1024 is not
really that large.
Andy