Hi,
One of our customers has an application that has highly localized graph
patterns, so they partition their data into named graphs and primarily perform
graph-level read and replace operations. They are trying to make a transition
from RDF4J and Neptune to Rya and seen some performance issues. I stood a local
Rya instance in my Ubuntu VM and got some measurements:
1. Loaded 11K datasets averaging about 120 triples each (total 1.4 million
triples)
2. Post insertion named graph fetch - 3.9 seconds. (RDF4J time was less than
a second)
3. Compacted all the tables, average fetch of a graph - 1.9 seconds
4. Rya stores the graph name in the column family, so a full fetch of a
named graph is range-less scan with a specified column family. Removed Rya from
the equation, wrote a small test program that did an equivalent column family
scan. Average time - 1.9 seconds, so it appears Rya overhead is negligible.
Tried variations with using a single range scanner, then a batch scanner with a
single range specified, just column family - same results
5. Furthermore, query did not speed with repetition, i.e. no index warming
effect
6. Modified my graph fetch query from
construct { ?s ?p ?o } where { graph <http://my/graph> { ?s ?p ?o }}
to
construct { ?s ?p ?o } where { graph <http://my/graph> { ?s a ?type; ?p ?o }}
(which produced the exact same RDF output)
This would execute as a range scan on the po table (using the rdf:type
predicate prefix), followed by a guided batch scan on the spo table on the
found subjects.
Total execution time = 0.85 seconds. After repetition = 0.46 seconds as the
indices warmed
So, what I see is Accumulo is much better about a range scan than a column
family scan, so much so that even running 2 scans and a join is still faster.
It seems that if we wanted to get decent performance on graph fetches, we would
have to generate a `gspo` table or something similar.
Any ideas of another approach to improve the performance of this type of query?
PS. Here is my test code,
import org.apache.accumulo.core.client.*;
import org.apache.accumulo.core.client.security.tokens.PasswordToken;
import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.Range;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.security.Authorizations;
import org.apache.hadoop.io.Text;
import java.util.Collections;
import java.util.Map;
public class ScanPerfTest {
public static void main(String[] args) {
String instanceName = "accumulo";
String zooServers = "localhost";
Instance inst = new ZooKeeperInstance(instanceName, zooServers);
try {
Connector con = inst.getConnector("rya", new PasswordToken("rya"));
Scanner s = con.createScanner("sa_ts_spo", new Authorizations());
try {
// s.setRange(new Range(
// new Key(new Text(new byte[]{})),
// new Key(new Text(new byte[]{(byte) 0xff}))));
s.fetchColumnFamily(new Text("http://my/graph"));
long start = System.currentTimeMillis();
int triples = 0;
for (Map.Entry<Key, Value> e : s) {
// System.out.println(e.getKey().getRow().toString());
triples++;
}
System.out.println("Read " + triples + " triples in " +
(System.currentTimeMillis() - start) + "ms");
} finally {
s.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Boris Pelakh
Ontologist, Developer, Software Architect
[email protected]<mailto:[email protected]>
+1-321-243-3804
[SemanticArtsLogo]