RE: Named graph performance issue

Boris Pelakh Wed, 08 May 2019 10:24:07 -0700

Thanks for the idea, I would give it a shot. It's not really the intended use 
of locality groups, since instead of a statically partitioned list, the way you 
would have with traditional column families, the named graph list is dynamic 
and large. I am going to try and partition based on a short hash of the graph 
name, see if I can get a decent distribution, then measure how it affects 
performance.
I guess if that works, then newly created graphs (which would initially slot 
into the default locality group) can be reassigned periodically as a  batch job 
and cleaned up during compaction.


What was your motivation to switching to a Mongo-based storage schema? Does it 
have advantages over Accumulo with regard to scalability or features? Is there 
documentation of the storage schema in the code base?

Boris Pelakh
Ontologist, Developer, Software Architect
[email protected]
+1-321-243-3804


-----Original Message-----
From: Puja Valiyil <[email protected]> 
Sent: Wednesday, May 8, 2019 12:32 PM
To: [email protected]
Subject: Re: Named graph performance issue

Hi Boris,
Did you try configuring accumulo to use locality groups?  I think that groups 
cf values in the same files, which may help in your case.  Sorry if I’m 
completely off base here— I’ve been in mongodb land for so long I may have lost 
touch on how the accumulo version of Rya works.
Thanks,
Puja

Sent from my iPhone

> On May 8, 2019, at 12:01 PM, Boris Pelakh <[email protected]> 
> wrote:
> 
> Hi,
>  
> One of our customers has an application that has highly localized graph 
> patterns, so they partition their data into named graphs and primarily 
> perform graph-level read and replace operations. They are trying to make a 
> transition from RDF4J and Neptune to Rya and seen some performance issues. I 
> stood a local Rya instance in my Ubuntu VM and got some measurements:
>  
> Loaded 11K datasets averaging about 120 triples each (total 1.4 
> million triples) Post insertion named graph fetch - 3.9 seconds. 
> (RDF4J time was less than a second) Compacted all the tables, average 
> fetch of a graph - 1.9 seconds Rya stores the graph name in the column 
> family, so a full fetch of a named graph is range-less scan with a 
> specified column family. Removed Rya from the equation, wrote a small 
> test program that did an equivalent column family scan. Average time - 
> 1.9 seconds, so it appears Rya overhead is negligible. Tried 
> variations with using a single range scanner, then a batch scanner 
> with a single range specified, just column family - same results 
> Furthermore, query did not speed with repetition, i.e. no index warming 
> effect Modified my graph fetch query from construct { ?s ?p ?o } where { 
> graph <http://my/graph> { ?s ?p ?o }} to construct { ?s ?p ?o } where { graph 
> <http://my/graph> { ?s a ?type; ?p ?o }} (which produced the exact same RDF 
> output) This would execute as a range scan on the po table (using the 
> rdf:type predicate prefix), followed by a guided batch scan on the spo table 
> on the found subjects.
> Total execution time = 0.85 seconds. After repetition = 0.46 seconds 
> as the indices warmed
>  
> So, what I see is Accumulo is much better about a range scan than a column 
> family scan, so much so that even running 2 scans and a join is still faster. 
> It seems that if we wanted to get decent performance on graph fetches, we 
> would have to generate a `gspo` table or something similar.
>  
> Any ideas of another approach to improve the performance of this type of 
> query?
>  
> PS. Here is my test code,
> import org.apache.accumulo.core.client.*;
> import org.apache.accumulo.core.client.security.tokens.PasswordToken;
> import org.apache.accumulo.core.data.Key;
> import org.apache.accumulo.core.data.Range;
> import org.apache.accumulo.core.data.Value;
> import org.apache.accumulo.core.security.Authorizations;
> import org.apache.hadoop.io.Text;
>  
> import java.util.Collections;
> import java.util.Map;
>  
> public class ScanPerfTest {
>  
>     public static void main(String[] args) {
>         String instanceName = "accumulo";
>         String zooServers = "localhost";
>         Instance inst = new ZooKeeperInstance(instanceName, 
> zooServers);
>  
>         try {
>             Connector con = inst.getConnector("rya", new 
> PasswordToken("rya"));
>             Scanner s = con.createScanner("sa_ts_spo", new Authorizations());
>             try {
> //                s.setRange(new Range(
> //                        new Key(new Text(new byte[]{})),
> //                        new Key(new Text(new byte[]{(byte) 0xff}))));
>                 s.fetchColumnFamily(new Text("http://my/graph";));
>                 long start = System.currentTimeMillis();
>                 int triples = 0;
>                 for (Map.Entry<Key, Value> e : s) {
>                     // System.out.println(e.getKey().getRow().toString());
>                     triples++;
>                 }
>                 System.out.println("Read " + triples + " triples in " + 
> (System.currentTimeMillis() - start) + "ms");
>             } finally {
>                 s.close();
>             }
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
> }
>  
>  
> Boris Pelakh
> Ontologist, Developer, Software Architect 
> [email protected]
> +1-321-243-3804
> 
>

RE: Named graph performance issue

Reply via email to