Re: improving query performance

Anthony Schiller Thu, 25 Apr 2019 09:36:05 -0700

Additional investigation into the performance of joins revealed that
queries are better served by setting the configuration parameter
“setUseJoinSelectivity” on
org.apache.rya.accumulo.AccumuloRdfConfigurationBuilder to true. To
work against this configuration I needed to setup the selectivity table
(as per the documentation here
https://github.com/apache/incubator-rya/blob/master/dao/accumulo.rya/src/main/java/org/apache/rya/accumulo/AbstractAccumuloRdfConfigurationBuilder.java
). The  selectivity table in turn requires a prospects table to have
been generated, which I managed using the guidelines supplied here
https://github.com/apache/incubator-rya/blob/master/extras/rya.manual/src/site/markdown/eval.md.

For 9.8 million entries to each triple table, I generated a prospects
table with approximately 22.5 million entries and a selectivity table
with 93.2 million entries.

With the above in place, when running a query I get the following
logged just ahead of the query plan being printed “Entering join
optimizer!”. This is being reported out of
org.apache.rya.rdftriplestore.evaluation.QueryJoinSelectOptimizer and
is perhaps where the magic happens. Sadly though the response times to
my queries are not demonstrating this, on the contrary the queries are
taking far longer to run. For example, a query that used to run in c. 6
minutes has yet to complete after c. 50 minutes.

I am happy to supply any further details/ information to help identify
why things are not performing as one might expect. Please let me know
what would help in this process.

Thanks for you support,
Anthony

On Thu, 2019-04-18 at 19:06 +0100, Anthony Schiller wrote:
> Hi, I'm working through an evaluation of Rya for my team here at
> Exfo.  As part of this we are running some benchmarks to compare Rya
> to
> other graph stores - https://merck.github.io/Halyard/
>  and https://www.stardog.com/.
>
> When it comes the querying performance we have a range of queries
> that
> vary in complexity. For some Rya seems to return the correct result
> quickly but for others it takes far longer we might have expected,
> far
> longer than Halyward and Stardog. It is certainly possible we are not
> making best use of the tools that Rya offers to demonstrate the
> performance it is capable of providing to these queries. There are
> certainly a lot of variables at play, we could well be making heavy
> work of some aspects to servicing these queries. One concern I have
> is
> the level of traffic running between my test node (NOT part of the
> accumulo cluster) from which I run a query through the Rya Sail
> across
> to the (accumulo) data nodes - this  I infer from the following log
> output being busy reported on data nodes:
>
> 2019-04-18 16:03:45,419 [tserver.TabletServer] DEBUG: MultiScanSess
> 192.168.X.YYY:44646 0 entries in 0.00 secs (lookup_time:0.00 secs
> tablets:1 ranges:1,000)
>
> where 192.168.X.YYY is my test node and observe using iftop (
> http://www.ex-parrot.com/~pdw/iftop/).
>
> I thought that the query would largely be serviced locally on the
> data
> nodes, with some result merging happening as a finally step as
> results
> are returned from the data nodes. I was anticipating the queries
> would
> return quickly for the 9.8 million quad-statements we have loaded
> into
> the data nodes with it being possible to perform full scans of this
> volume of data in short-order. Though understandably if the data is
> being pushed from the data node to my test node for query evaluation
> this overhead will impact performance.
>
> I attach the following:
> 1) Ambari blueprint cluster configurations
> 2) Screenshots from the Accumulo web-ui for some insight into how
> this
> data resides across 4 data nodes
> 3) A query plan that takes c. 7 minutes to return results.
>
> Please let me know of other information that would help to
> investigate
> this further.
>
> Many thanks for your direction and help in advance,
> Anthony
Le contenu de ce courriel et de toute pièce jointe est destiné à l’usage 
exclusif de son destinataire. Il contient des renseignements exclusifs, 
privilégiés, confidentiels ou assujettis au droit d’auteur. Toute divulgation, 
distribution ou reproduction non autorisée est strictement interdite. Si vous 
n’êtes pas le destinataire prévu, veuillez-nous en aviser immédiatement et 
supprimer toutes les copies de ce courriel et des pièces jointes. Les courriels 
sont susceptibles d’altération. EXFO Inc. et ses sociétés affiliées ne seront 
pas tenues responsables du message s’il a été contrefait, modifié ou falsifié.

The content of this email and any of its attachments is intended for the 
exclusive use of its recipient. It contains information that is proprietary, 
privileged, confidential and/or subject to copyright. Any unauthorized 
disclosure, distribution or reproduction is strictly prohibited. If you are not 
the intended recipient, please notify us immediately and delete all copies of 
this email and any attachments. E-mails are susceptible to alteration. EXFO 
Inc. and its affiliates shall not be liable for the message if altered, changed 
or falsified.

Re: improving query performance

Reply via email to