Named graph performance issue

Boris Pelakh Wed, 08 May 2019 09:01:46 -0700

Hi,

One of our customers has an application that has highly localized graph 
patterns, so they partition their data into named graphs and primarily perform 
graph-level read and replace operations. They are trying to make a transition 
from RDF4J and Neptune to Rya and seen some performance issues. I stood a local 
Rya instance in my Ubuntu VM and got some measurements:



  1.  Loaded 11K datasets averaging about 120 triples each (total 1.4 million 
triples)
  2.  Post insertion named graph fetch - 3.9 seconds. (RDF4J time was less than 
a second)
  3.  Compacted all the tables, average fetch of a graph - 1.9 seconds
  4.  Rya stores the graph name in the column family, so a full fetch of a 
named graph is range-less scan with a specified column family. Removed Rya from 
the equation, wrote a small test program that did an equivalent column family 
scan. Average time - 1.9 seconds, so it appears Rya overhead is negligible. 
Tried variations with using a single range scanner, then a batch scanner with a 
single range specified, just column family - same results
  5.  Furthermore, query did not speed with repetition, i.e. no index warming 
effect
  6.  Modified my graph fetch query from
construct { ?s ?p ?o } where { graph <http://my/graph> { ?s ?p ?o }}
to
construct { ?s ?p ?o } where { graph <http://my/graph> { ?s a ?type; ?p ?o }}
(which produced the exact same RDF output)
This would execute as a range scan on the po table (using the rdf:type 
predicate prefix), followed by a guided batch scan on the spo table on the 
found subjects.
Total execution time = 0.85 seconds. After repetition = 0.46 seconds as the 
indices warmed

So, what I see is Accumulo is much better about a range scan than a column 
family scan, so much so that even running 2 scans and a join is still faster. 
It seems that if we wanted to get decent performance on graph fetches, we would 
have to generate a `gspo` table or something similar.

Any ideas of another approach to improve the performance of this type of query?

PS. Here is my test code,
import org.apache.accumulo.core.client.*;
import org.apache.accumulo.core.client.security.tokens.PasswordToken;
import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.Range;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.security.Authorizations;
import org.apache.hadoop.io.Text;

import java.util.Collections;
import java.util.Map;

public class ScanPerfTest {

    public static void main(String[] args) {
        String instanceName = "accumulo";
        String zooServers = "localhost";
        Instance inst = new ZooKeeperInstance(instanceName, zooServers);

        try {
            Connector con = inst.getConnector("rya", new PasswordToken("rya"));
            Scanner s = con.createScanner("sa_ts_spo", new Authorizations());
            try {
//                s.setRange(new Range(
//                        new Key(new Text(new byte[]{})),
//                        new Key(new Text(new byte[]{(byte) 0xff}))));
                s.fetchColumnFamily(new Text("http://my/graph";));
                long start = System.currentTimeMillis();
                int triples = 0;
                for (Map.Entry<Key, Value> e : s) {
                    // System.out.println(e.getKey().getRow().toString());
                    triples++;
                }
                System.out.println("Read " + triples + " triples in " + 
(System.currentTimeMillis() - start) + "ms");
            } finally {
                s.close();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}


Boris Pelakh
Ontologist, Developer, Software Architect
[email protected]<mailto:[email protected]>
+1-321-243-3804
[SemanticArtsLogo]

Named graph performance issue

Reply via email to