Additional investigation into the performance of joins revealed that queries are better served by setting the configuration parameter “setUseJoinSelectivity” on org.apache.rya.accumulo.AccumuloRdfConfigurationBuilder to true. To work against this configuration I needed to setup the selectivity table (as per the documentation here https://github.com/apache/incubator-rya/blob/master/dao/accumulo.rya/src/main/java/org/apache/rya/accumulo/AbstractAccumuloRdfConfigurationBuilder.java ). The selectivity table in turn requires a prospects table to have been generated, which I managed using the guidelines supplied here https://github.com/apache/incubator-rya/blob/master/extras/rya.manual/src/site/markdown/eval.md.
For 9.8 million entries to each triple table, I generated a prospects table with approximately 22.5 million entries and a selectivity table with 93.2 million entries. With the above in place, when running a query I get the following logged just ahead of the query plan being printed “Entering join optimizer!”. This is being reported out of org.apache.rya.rdftriplestore.evaluation.QueryJoinSelectOptimizer and is perhaps where the magic happens. Sadly though the response times to my queries are not demonstrating this, on the contrary the queries are taking far longer to run. For example, a query that used to run in c. 6 minutes has yet to complete after c. 50 minutes. I am happy to supply any further details/ information to help identify why things are not performing as one might expect. Please let me know what would help in this process. Thanks for you support, Anthony On Thu, 2019-04-18 at 19:06 +0100, Anthony Schiller wrote: > Hi, I'm working through an evaluation of Rya for my team here at > Exfo. As part of this we are running some benchmarks to compare Rya > to > other graph stores - https://merck.github.io/Halyard/ > and https://www.stardog.com/. > > When it comes the querying performance we have a range of queries > that > vary in complexity. For some Rya seems to return the correct result > quickly but for others it takes far longer we might have expected, > far > longer than Halyward and Stardog. It is certainly possible we are not > making best use of the tools that Rya offers to demonstrate the > performance it is capable of providing to these queries. There are > certainly a lot of variables at play, we could well be making heavy > work of some aspects to servicing these queries. One concern I have > is > the level of traffic running between my test node (NOT part of the > accumulo cluster) from which I run a query through the Rya Sail > across > to the (accumulo) data nodes - this I infer from the following log > output being busy reported on data nodes: > > 2019-04-18 16:03:45,419 [tserver.TabletServer] DEBUG: MultiScanSess > 192.168.X.YYY:44646 0 entries in 0.00 secs (lookup_time:0.00 secs > tablets:1 ranges:1,000) > > where 192.168.X.YYY is my test node and observe using iftop ( > http://www.ex-parrot.com/~pdw/iftop/). > > I thought that the query would largely be serviced locally on the > data > nodes, with some result merging happening as a finally step as > results > are returned from the data nodes. I was anticipating the queries > would > return quickly for the 9.8 million quad-statements we have loaded > into > the data nodes with it being possible to perform full scans of this > volume of data in short-order. Though understandably if the data is > being pushed from the data node to my test node for query evaluation > this overhead will impact performance. > > I attach the following: > 1) Ambari blueprint cluster configurations > 2) Screenshots from the Accumulo web-ui for some insight into how > this > data resides across 4 data nodes > 3) A query plan that takes c. 7 minutes to return results. > > Please let me know of other information that would help to > investigate > this further. > > Many thanks for your direction and help in advance, > Anthony Le contenu de ce courriel et de toute pièce jointe est destiné à l’usage exclusif de son destinataire. Il contient des renseignements exclusifs, privilégiés, confidentiels ou assujettis au droit d’auteur. Toute divulgation, distribution ou reproduction non autorisée est strictement interdite. Si vous n’êtes pas le destinataire prévu, veuillez-nous en aviser immédiatement et supprimer toutes les copies de ce courriel et des pièces jointes. Les courriels sont susceptibles d’altération. EXFO Inc. et ses sociétés affiliées ne seront pas tenues responsables du message s’il a été contrefait, modifié ou falsifié. The content of this email and any of its attachments is intended for the exclusive use of its recipient. It contains information that is proprietary, privileged, confidential and/or subject to copyright. Any unauthorized disclosure, distribution or reproduction is strictly prohibited. If you are not the intended recipient, please notify us immediately and delete all copies of this email and any attachments. E-mails are susceptible to alteration. EXFO Inc. and its affiliates shall not be liable for the message if altered, changed or falsified.
