On 03/06/2011 02:33 PM, Ignacio Serantes wrote: > A better visibility check directly on the resource. This also requires a > nao:userVisible property on all resources which is done by a new class. > > > Great!, precisely this is the big problem I found with your queries and > you resolved adding a flag. Please, check the following simple queries > with one string tests. Strings are selected to obtain result sets with > different size. > > Test #01 [ API_H1, (dorama)]: 93 results in 0.06073117 > seconds > Test #02 [ API_H2, (dorama)]: 93 results in 1.84762883 > seconds > Test #03 [ API_H3, (dorama)]: 93 results in 2.01492691 seconds > Test #04 [ API_H4, (dorama)]: 93 results in 1.91813993 seconds > Test #05 [ API_H5, (dorama)]: 93 results in 1.60597801 seconds > Test #06 [ API, (dorama)]: 93 results in 2.23234606 seconds > Test #07 [ API_H1, (ha ji won)]: 2 results in 0.01688004 > seconds > Test #08 [ API_H2, (ha ji won)]: 2 results in 5.00310612 seconds > Test #09 [ API_H3, (ha ji won)]: 2 results in 4.11244702 seconds > Test #10 [ API_H4, (ha ji won)]: 2 results in 3.95332003 seconds > Test #11 [ API_H5, (ha ji won)]: 2 results in 3.90592694 seconds > Test #12 [ API, (ha ji won)]: 2 results in 11.18835807 seconds > Test #13 [ API_H1, (music)]: 5434 results in 1.39107108 seconds > Test #14 [ API_H2, (music)]: 5420 results in 4.74685478 seconds > Test #15 [ API_H3, (music)]: 5420 results in 5.27211499 seconds > Test #16 [ API_H4, (music)]: 5420 results in 5.14297509 seconds > Test #17 [ API_H5, (music)]: 5420 results in 4.99993420 seconds > Test #18 [ API, (music)]: 5420 results in 9.88414216 seconds > > The query is the same but I applied minor changes: > API_H1: is the same query without visibility inner join so easy to > wonder where is the main performance problem. > API_H2: is the query using a subquery and not an inner join. Query is > equivalent (A U B) X C = (A X C) U (B X C). Because subquery is the same > could be optimized by query optimizer and, because intermediate joins > result sets are small union is fast. This is easy to see with a simple > SELECT DISTINCT * in both queries and comparing result sets. > API_H3: is like H1 without using optional to obtain columns. > API_H4: is API_H3 with a different filter construction method. > API_H5: is API_H3 with another different filter construction method. > API: is the query created by the API so no additional explanation is > needed. It is the last one to give it cache advantage over other queries. > Note: API_H1 results number differs in "music" case because this query > is not equivalent to the other five.
As a first hacky query optimization you might try to use ?r a ?y . ?y nao:userVisible ?visiblity . filter(regex(?visibility),'true')) . looks weird but should do wonders because almost all resources are visible. The optimization I was talking about is inference on client level. (As I mentioned I am not doing much on the DB level yet.) It creates a nao:userVisible value for each resource. Thus, one can test the visibility directly on the resource. That combined with the filter above is a big performance improvement. Obviously it is hacky and a view as you mention seems a much better solution. Actually there is no dedicated index for nao:userVisible yet. The DB is totally generic. We are only beginning with the optimization. Your input is very very helpful. Cheers, Sebastian > This is an initial test and with my current db knowledge caution is > mandatory but here are some initial conclusions: > 1) As in SQL, subqueries increment performance > over indiscriminated inner joins in large results sets. I'm not sure if > this is general to all tripletstore dbms or only to Virtuoso that is a > rdbms with added tripletstore functionally. Seems like search API must > build queries use less joins and more subqueries. > 2) Optional seems haven't performance impact extracting column values > but queries are less easy to read and you must write more characters :). > 3) API_H3, API_H4 and API_H5 has similar times so without a profiling > tool and more test is difficult to wonder what is the best. Probably > query optimizer is doing it's job and, in fact, there is no differences > at all. > > In "sparql_test.spql" attach there are the queries. > > I go back to stored procedures because this problem could be solved > using Virtuoso. You have the need to filter for an value always so you > construct a relation any time you need to filter for this value. The > problem with this approach is that any time you do a query you must > build again this relation and this is time consuming. And the problem is > more and more serious when data grows. > > In my db there are 23.766 results to userVisible = True and doing all > queries with a join with a table like this is not a good idea and we > can't trust that query optimizers do all your job without any > help. There are some solutions to this problem at db level: > > 1) The (probably) ideal one, create a view in the dbms using your query > and use always this to filter your data. If query is very simple and > dbms is good this is automaintained and fast because view don't exists > and is only a different representation of data in your db. > > 2) If the first approach don't works, view is slow and general db > performance is degrading, you must create a table to store this data and > use stored procedures and triggers to maintain this table. A view from > first case and a physical table is the same at query language level so, > if we tried the first case, most of the work is done and your queries > need only minor changes or even none. > > 3) The easy one, add a property with and index to do the filter. You > told that this is your approach and works but you must be cautious > because now is visibility, later is user restriction, later devices > restriction and so on, and you can't solve all your filter problems with > properties as you can't solve all your query problems with joins. > Increase your size register and add so many indexes could have a > performance penalty when your data grows but note that this differs from > one dbms to other so there is no general rules. > > You can solve this problem in application layer too but, if the problem > is db related must be solved at db level if if possible. Obviously this > is a relaxed rule. > > Because, sadly, there is no perfect solution you must try > different approaches and use the solution that better suits to any case > and, here, profiling tools are your friends. About this, I will try to > activate Virtuoso web interface to activate profiling but I can't. I > download and compile Virtuoso and with my instance web interface is > available but I can't imagine how activate it in Nepomuk's Virtuoso > instance. I can copying the db and use it in my Virtuoso instance but > this is uncomfortably if I need doing changes in data using Nepomuk. > > I found many errors in soprano-virtuoso db log. I created a db from > scratch and this errors persists: no method of name existsNode, > getClobVal, getNumVal, getSchemaURL etc... I'm informing you because I > don't know if this errors are relevant or not. > > I'm sorry if I'm not much help but I'm doing my best. I never worked > before with Sparql and tripletstore but I'm learning. > > > Cheers, > Sebastian > _______________________________________________ > Nepomuk mailing list > [email protected] <mailto:[email protected]> > https://mail.kde.org/mailman/listinfo/nepomuk > > > > > -- > Cheers, > Ignacio > > > > > _______________________________________________ > Nepomuk mailing list > [email protected] > https://mail.kde.org/mailman/listinfo/nepomuk _______________________________________________ Nepomuk mailing list [email protected] https://mail.kde.org/mailman/listinfo/nepomuk
