Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-09 Thread Simon Slavin

On 9 Nov 2014, at 1:49pm, Tristan Van Berkom  wrote:

> This year in particular I've been faced with my first queries of modest
> complexity, see for example this (temporary) paste:
> 
>http://www.fpaste.org/148918/41545194/
> 
> This is the beginnings of a query which will try to fetch an available
> employee for a given 'task', ordered eventually by their relevance,
> first set 'is it their job ?' second set 'do they have the licenses and
> formations to actually perform the task, even if it's not their job' ?
> and so on and so forth.
> 
> I've found that when trying to bend my brain around these problems of
> modest complexity, they are only understandable if I visualize them in
> terms of 'sets of rows' and how they join to each other.

I haven't looked into your specific example in depth, but you should also be 
aware of another solution to this problem: Make a good but simple SQL query 
that gets you some way to your data, then parse the resulting rows using your 
programming language.  This can lead to a far simpler program which is faster 
to write, easier to debug, and more flexible than what can be done entirely 
inside SQL.

Just because one can write complicated iterating queries with sub-SELECTs in 
SQL doesn't means one should.  There comes a point where it's batter to do the 
work inside whatever programming language you're using.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-09 Thread Tristan Van Berkom
On Sun, 2014-11-09 at 15:04 +0200, RSmith wrote:
> On 2014/11/09 14:11, Tristan Van Berkom wrote:
> >> A good SQL rule of thumb: if you can think of a way, so can the DBMS. "... 
> >> no opportunity to make a good guess" is not true. In 
> >> some sense, SQLite has had 10 years to make a good guess, and often does. 
> >> A nested select need not be materialized as a "table", 
> >> opaque or otherwise. It can be converted, or "flattened" under some 
> >> circumstances. SQLite's query planner isn't the most 
> >> sophisticated; no one suggests otherwise. It does not follow that every 
> >> JOIN will outperform every EXISTS or vice versa. 
> > Indeed this is a large misconception on my part, on the queries which
> > I *have* profiled, it did turn out that JOINs were more effective than
> > solutions which involve nesting select statements.
> 
> This seems like the appropriate place to interject with a Statistics / 
> Sample-size comment, but I think the point is obvious.
> 
> > Anyway, I appreciate the input and will try to accept that I should not
> > be in control of how the query is run - I was under the impression that
> > SQL engines can perform better when given more context about how the
> > query should run (i.e. being more explicit with JOINs), but I do agree
> > that, at least ideally, the planner should be able to make a better
> > guess as to how to plot a query with a more relaxed/vague statement,
> > than with a more explicit one.
> 
> Woah... this is not at all what James tried to say and it might just be that 
> your choice of words is unfortunate and you did mean it 
> correctly, but just to be sure, allow me to elaborate somewhat:
> 
> The query should in no way be "Vague" or "Relaxed", it should be precise, 
> succint, exact and fully correct making the intended 
> result impossible to misinterpret.
[...]

No you're right, it's a poor choice of words on my part, I don't mean
vague in terms of what exact set of rows I want to extract, but rather
vague in terms of 'what will happen when I issue this query'.

Which I understand, and do agree, is something that ideally the query
author should not have to trouble himself with.

This year in particular I've been faced with my first queries of modest
complexity, see for example this (temporary) paste:

http://www.fpaste.org/148918/41545194/

This is the beginnings of a query which will try to fetch an available
employee for a given 'task', ordered eventually by their relevance,
first set 'is it their job ?' second set 'do they have the licenses and
formations to actually perform the task, even if it's not their job' ?
and so on and so forth.

I've found that when trying to bend my brain around these problems of
modest complexity, they are only understandable if I visualize them in
terms of 'sets of rows' and how they join to each other.

You'll note that I have nested the UNIONs in a select specifically
because there are common JOINs I want to perform on the collective
result, thus reducing the redundancy of the query (here I am again,
trying to out-optimize the query planner).

Anyway, I think you've all collectively succeeded in pointing out the
error of my ways ;-)

I will try to formulate more straight forward queries which only ask
for the succinct set of data that I want to pull, rather than the 
queries I've been writing which are probably more constrictive to
the planner for possibly no justifiable reason.

Cheers,
-Tristan


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-09 Thread RSmith


On 2014/11/09 14:11, Tristan Van Berkom wrote:
A good SQL rule of thumb: if you can think of a way, so can the DBMS. "... no opportunity to make a good guess" is not true. In 
some sense, SQLite has had 10 years to make a good guess, and often does. A nested select need not be materialized as a "table", 
opaque or otherwise. It can be converted, or "flattened" under some circumstances. SQLite's query planner isn't the most 
sophisticated; no one suggests otherwise. It does not follow that every JOIN will outperform every EXISTS or vice versa. 

Indeed this is a large misconception on my part, on the queries which
I *have* profiled, it did turn out that JOINs were more effective than
solutions which involve nesting select statements.


This seems like the appropriate place to interject with a Statistics / 
Sample-size comment, but I think the point is obvious.


Anyway, I appreciate the input and will try to accept that I should not
be in control of how the query is run - I was under the impression that
SQL engines can perform better when given more context about how the
query should run (i.e. being more explicit with JOINs), but I do agree
that, at least ideally, the planner should be able to make a better
guess as to how to plot a query with a more relaxed/vague statement,
than with a more explicit one.


Woah... this is not at all what James tried to say and it might just be that your choice of words is unfortunate and you did mean it 
correctly, but just to be sure, allow me to elaborate somewhat:


The query should in no way be "Vague" or "Relaxed", it should be precise, succint, exact and fully correct making the intended 
result impossible to misinterpret. What it should not be is convoluted with additional hot air that attempts to better explain to 
the QP how to do the query... that is where we trust the QP to do it's job as long as the question is not Vague and indeed very 
precise and concise.


Sometimes the Query planner is faced with situations where an algebraically correct result can be achieved via different ways 
(picking/using an Index (if any) is a good case-in-point) and sometimes the best one is not obvious from the query or table size. In 
these cases, providing hints to it - such as running Analyze (Stat3 or 4 tables) and providing direct hints (such as 'Likely' and 
'Likelihood') - is a good way for a programmer (or "Query Author" I should say) to /help/ the QP decide on what is best - but these 
should be used sparingly and only where the improvement is substantial and the optimisation not premature.


Another consideration is that the QP gets updated almost every other month... you have no way of knowing whether your convoluted 
query (which may shave off a few milliseconds now) will still be the fastest next month when the QP is a bit smarter. The long-term 
best is to formulate the most precise and concise query possible and let the Engine do the work.



Cheers,
Ryan.


(PS: I don't mean to second-guess James, and the only reason I feel any comfort saying what he meant is that we just had an 
interesting debate about this elsewhere and I have come to appreciate the view above somewhat more).





___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-09 Thread Tristan Van Berkom
On Sat, 2014-11-08 at 14:27 -0500, James K. Lowden wrote:
> On Sun, 09 Nov 2014 00:45:16 +0900
> Tristan Van Berkom  wrote:
> 
> > While I do understand SQL as a functional language, most functional
> > programming I've done still has rather explicit syntax/rules, so I get
> > particularly uncomfortable with writing vague statements, such as
> > JOIN tableA, tableB WHERE ... without being exactly specific on the
> > heart/nature of the join which is happening.
> 
> Oh, I think you'll agree SQL has explicit syntax.  I think what you
> mean by "vague" is "nondeterministic with respect to the physical data
> structures and algorithms".  About that you're right; it's considered a
> feature.  :-)  
> 
> The idea is that the engine interpreting your SQL might not be
> absolutely as fast as the most optimal solution for your hardware and
> data at any one point in time.  But it will be nearly so, much more
> adaptable as hardware and data change, and thereby not trouble your
> application with issues outside its problem domain.  And of course
> there are other advantages besides, as you know.  
> 
> Partly it's a matter of trust.  You trust the OS to schedule your job
> fairly, to supply virtual memory, to deliver your TCP packet.  You
> trust SQLite to deliver on its ACID contract, and to produce logically
> correct result from queries.  Why not trust it to find the shortest
> path to your data?

Yeah, I'm only starting to catch on though :)

> 
> > Also what I've found in my limited experience is that nesting SELECT
> > statements, at least in SQLite, completely throws off the planner,
> > as in it has no opportunity to make a good guess and has to deal
> > with the result set of a nested SELECT as an opaque table, 
> 
> A good SQL rule of thumb: if you can think of a way, so can the
> DBMS.  "... no opportunity to make a good guess" is not true.  In some
> sense, SQLite has had 10 years to make a good guess, and often does.  
> 
> A nested select need not be materialized as a "table", opaque or
> otherwise.  It can be converted, or "flattened" under some
> circumstances.  SQLite's query planner isn't the most sophisticated; no
> one suggests otherwise.  It does not follow that every JOIN will
> outperform every EXISTS or vice versa.  

Indeed this is a large misconception on my part, on the queries which
I *have* profiled, it did turn out that JOINs were more effective than
solutions which involve nesting select statements.

> > is generally not an indexed table (or perhaps it is, but I wouldnt
> > know because those indexes don't seem to be declarative in any way).
> 
> I don't know what you're mean by indexes that "don't seem to be
> declarative".  

This was in context of the paragraph, i.e. it's conceivable that it
may make sense to build a temporary index on the result of a nested
select statement while accumulating results, and indeed, depending
on the outer query EXPLAIN tells me it builds a temporary B-TREE on
the inner query results... however there is no way to declare (afaics)
what kind of temporary index is to be built on the result set of
a nested inner select statement.

That's all that I meant by that.

> > So indeed, I am not comfortable with 'leaving it up to chance',
> > and if there is a way to get higher specificity, I try to achieve
> > that.
> 
> If I may be so bold, beware of your assumptions.  The fastest plan with
> 100 rows may perform very poorly with 100 million rows.  Be careful
> what you optimize for.  :-)  
> 
> My fans on this list (both of them) will be surprised, though, that I
> partly agree with you.  In terms of technology, there's very little
> middle ground between fopen(3) and SQL's "I give query you give data"
> contract.  SQL experts, no matter the platform, spend a fair amount of
> time coercing the system into using an efficient query plan.  They
> cannot say "apply criteria X to index Y and join to table T"; they must
> work by indirection, creating indexes and using "query hints" (or,
> sometimes, other query formations) until the planner does the "right
> thing", however defined.  I sometimes wish for a system that let me
> express the query algebraically and the order of operation explicitly,
> but afaik no such system exists except partially and grudgingly within
> SQL.  

Indeed, and here's an example of a similar struggle I had, where I 
discovered the secret '+' trick, on this very list:

  https://www.mail-archive.com/sqlite-users%40sqlite.org/msg80881.html

Anyway, I appreciate the input and will try to accept that I should not
be in control of how the query is run - I was under the impression that
SQL engines can perform better when given more context about how the
query should run (i.e. being more explicit with JOINs), but I do agree
that, at least ideally, the planner should be able to make a better
guess as to how to plot a query with a more relaxed/vague statement,
than with a more explicit one.

Thanks for the discussion :)

Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-08 Thread James K. Lowden
On Sun, 09 Nov 2014 00:45:16 +0900
Tristan Van Berkom  wrote:

> While I do understand SQL as a functional language, most functional
> programming I've done still has rather explicit syntax/rules, so I get
> particularly uncomfortable with writing vague statements, such as
> JOIN tableA, tableB WHERE ... without being exactly specific on the
> heart/nature of the join which is happening.

Oh, I think you'll agree SQL has explicit syntax.  I think what you
mean by "vague" is "nondeterministic with respect to the physical data
structures and algorithms".  About that you're right; it's considered a
feature.  :-)  

The idea is that the engine interpreting your SQL might not be
absolutely as fast as the most optimal solution for your hardware and
data at any one point in time.  But it will be nearly so, much more
adaptable as hardware and data change, and thereby not trouble your
application with issues outside its problem domain.  And of course
there are other advantages besides, as you know.  

Partly it's a matter of trust.  You trust the OS to schedule your job
fairly, to supply virtual memory, to deliver your TCP packet.  You
trust SQLite to deliver on its ACID contract, and to produce logically
correct result from queries.  Why not trust it to find the shortest
path to your data?  

> Also what I've found in my limited experience is that nesting SELECT
> statements, at least in SQLite, completely throws off the planner,
> as in it has no opportunity to make a good guess and has to deal
> with the result set of a nested SELECT as an opaque table, 

A good SQL rule of thumb: if you can think of a way, so can the
DBMS.  "... no opportunity to make a good guess" is not true.  In some
sense, SQLite has had 10 years to make a good guess, and often does.  

A nested select need not be materialized as a "table", opaque or
otherwise.  It can be converted, or "flattened" under some
circumstances.  SQLite's query planner isn't the most sophisticated; no
one suggests otherwise.  It does not follow that every JOIN will
outperform every EXISTS or vice versa.  

> is generally not an indexed table (or perhaps it is, but I wouldnt
> know because those indexes don't seem to be declarative in any way).

I don't know what you're mean by indexes that "don't seem to be
declarative".  

> So indeed, I am not comfortable with 'leaving it up to chance',
> and if there is a way to get higher specificity, I try to achieve
> that.

If I may be so bold, beware of your assumptions.  The fastest plan with
100 rows may perform very poorly with 100 million rows.  Be careful
what you optimize for.  :-)  

My fans on this list (both of them) will be surprised, though, that I
partly agree with you.  In terms of technology, there's very little
middle ground between fopen(3) and SQL's "I give query you give data"
contract.  SQL experts, no matter the platform, spend a fair amount of
time coercing the system into using an efficient query plan.  They
cannot say "apply criteria X to index Y and join to table T"; they must
work by indirection, creating indexes and using "query hints" (or,
sometimes, other query formations) until the planner does the "right
thing", however defined.  I sometimes wish for a system that let me
express the query algebraically and the order of operation explicitly,
but afaik no such system exists except partially and grudgingly within
SQL.  

--jkl
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-08 Thread Tristan Van Berkom
On Sat, 2014-11-08 at 09:46 -0700, Keith Medcalf wrote:
> On Saturday, 8 November, 2014 06:56, Tristan Van Berkom 
>  said:
> 
> >On Sat, 2014-11-08 at 06:23 -0700, Keith Medcalf wrote:
> >> How about the direct approach:
> >>
> >> SELECT uid
> >>   FROM resource
> >>  WHERE uid NOT IN (SELECT resource_uid
> >>  FROM event_participant, event
> >> WHERE event_participant.event_uid = event.uid
> >>   AND event.shift_uid = :shift_uid
> >>   AND event.date = :event_date)
> >>
> >> or perhaps
> >>
> >> SELECT uid
> >>   FROM resource
> >>  WHERE NOT EXISTS (SELECT 1
> >>  FROM event_participant, event
> >> WHERE event_participant.event_uid = event.uid
> >>   AND resource_uid = resource.uid
> >>   AND event.shift_uid = :shift_uid
> >>   AND event.date = :event_date)
> >>
> >> Is not the "right way to do it" the one that obtains the result
> >required, not the one that uses a "checkbox" implementation?
> >
> >I've presented 2 queries which both obtain the required result, and I
> >would have to say both of them have varying levels of correctness.
> >
> >If obtaining the correct result was all that was required, then I could
> >simply issue multiple queries and balloon my memory with result UIDs
> >in between each of them, all for the sake of better readability (but
> >then, I'm not sure why I would need a nice tool like SQLite to do it
> >in the first place).
> >
> >So I would have to say, the "right way to do it" is the most efficient
> >way, the one which provides SQLite with the best indications of how
> >to plot an efficient query plan.
> >
> >The heuristics of the table is generally that resources are rather
> >finite, while events grow over time (possibly 10 events each with
> >10 participants for 3 shifts per day, meaning after a year that
> >table will be large, and we'll still want to find the available
> >resources hopefully in under 30ms, ideally 10ms).
> >
> >So yeah, speed at query time (not insert/update/delete time), is very
> >important.
> 
> While either of the above are direct translations, if you wish to minimize 
> materialization of intermediate data then the second would be preferable 
> since it uses only inner joins.  However, the first may be far more efficient 
> depending on the shape of your data and the number of rows, how much memory 
> is available for SQLite to use (Bytes, KBytes, MBytes), and the speed of your 
> I/O devices.
> 
> Given that selecting all the resources based on your stated heuristics of 
> growth, having proper indexes, and the constraints given in the query, I 
> would say that the first NOT IN will be more efficient since it is only 
> building a list of a (couple) hundred or so exclusions, and NOT IN is very 
> efficient.  The second form will use a few thousand bytes less memory but 
> perform significantly more I/O.
> 
> Theoretically, of course, the execution plan of a query which obtains a 
> specified result should not depend on the phrasing of the query.  They should 
> all result in the same execution plan and the same result.  However, this is 
> not true in practice because of limitations in optimization.  Some optimizers 
> need a bit of help in figuring out how to generate an optimum plan, and some 
> others will generate overly complicated and inefficient plans simply because 
> they prefer to use newly added or nifty features only available in that 
> particular database engine rather than a simpler, faster, more direct 
> solution.  
> 
> I don't know why an outer join would be preferable in any case -- you are 
> going to use extra CPU and I/O to perform a join on which a significant 
> proportion of the results will be discarded -- or depend on the optimizer to 
> notice this and "optimize" away the outer join thus ending up with a "more 
> direct" phrasing which could have been written in the first place (and is far 
> more understandable for future maintainers).
> 
> Many optimizers can only optimize within specific constraints and specifying 
> a cute complicated query will result in the most efficient cute and 
> complicated solution it can muster (ie, a local optimum, not a global 
> optimum).
> 
> Your statement 1 implements NOT IN indirectly by doing a LEFT JOIN then 
> discarding the intersection set.  This requires an extra join and discard 
> over a more direct NOT IN phrasing.  It is also more difficult to understand. 
>  It could be optimized into the NOT IN or NOT EXISTS form by the optimizer if 
> you are lucky (presently it will not).
> 
> Your statement 2 implement the correlated subquery NOT EXISTS by an indirect 
> expression, again using a LEFT JOIN then discarding the intersection results. 
>  Same comments regarding extra operations and optimizations that cannot 
> (presently) be done apply.
> 
> If you prefer the ON 

Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-08 Thread Darko Volaric
There's nothing vague about select statements, they're logical formulas
involving the data in your database and as exact as any other programming
language, albeit in a very different domain. Relational databases are based
on first order predicate logic and have operations that are are rigorously
defined.

You're obviously thinking about the limitations of the database's optimizer
and the opaque way it does its work, but just as a real time or embedded
programmer may have qualms about (the uncertainty of) the code produced by
a compiler and prefer hand rolled assembler, you want more control and
certainty over what the database does. The problem in both cases is that
the programmer assumes they can do better than the machine, which is highly
questionable given the current state of the art in both cases.

You can give the optimiser tips about how to go about its work and that can
help improve efficiency, but you are better off trusting the optimiser
rather than trying to fight it. If there is an actual problem with the
efficiency of a well written select, it's almost always rooted in the
design of the database schema, especially to much or too little
de-normalization.

On Sat, Nov 8, 2014 at 7:45 AM, Tristan Van Berkom  wrote:

> On Sat, 2014-11-08 at 10:23 -0500, James K. Lowden wrote:
> > On Sat, 08 Nov 2014 22:55:46 +0900
> > Tristan Van Berkom  wrote:
> >
> > > So I would have to say, the "right way to do it" is the most efficient
> > > way, the one which provides SQLite with the best indications of how
> > > to plot an efficient query plan.
> >
> > Keith is suggesting that the right way to do it is neither "any way that
> > works" nor necessarily "whatever is fastest" but "the clearest
> > formulation of the query".  Clarity has the salutary property of being
> > most likely to be correct (because understood by the human) and stands
> > a better than fair chance of being executed efficiently (because it
> > translates easily to a good query plan).
> >
> > Most of time -- not every time, but most of the time -- indexes
> > and table design matter much more to efficient execution than query
> > syntax. When a clearly expressed query is not executed efficiently in
> > the presence of useful indexes, and especially when a slightly different
> > one does, that's usually considered a defect of the query planner.
>
> I see what you're saying, as I mentioned in the initial email I do
> consider myself to be a relative newbie, and the majority of work
> I've been doing with SQL (until this year) has been working on
> optimizing existing schemas/queries for embedded use.
>
> While I do understand SQL as a functional language, most functional
> programming I've done still has rather explicit syntax/rules, so I get
> particularly uncomfortable with writing vague statements, such as
> JOIN tableA, tableB WHERE ... without being exactly specific on the
> heart/nature of the join which is happening.
>
> Also what I've found in my limited experience is that nesting SELECT
> statements, at least in SQLite, completely throws off the planner,
> as in it has no opportunity to make a good guess and has to deal
> with the result set of a nested SELECT as an opaque table, which
> is generally not an indexed table (or perhaps it is, but I wouldnt
> know because those indexes don't seem to be declarative in any way).
>
> So indeed, I am not comfortable with 'leaving it up to chance',
> and if there is a way to get higher specificity, I try to achieve that.
>
> In any case, I do have queries which work well at this point, but
> posted this question to the list in the hope I could find the right
> specificity for the given query - it's not a very big deal, I'll
> probably just stick with what works and try to fix it in the places
> where profiling reveals that I'm doing something wrong.
>
> Cheers,
> -Tristan
>
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-08 Thread Keith Medcalf

On Saturday, 8 November, 2014 06:56, Tristan Van Berkom 
 said:

>On Sat, 2014-11-08 at 06:23 -0700, Keith Medcalf wrote:
>> How about the direct approach:
>>
>> SELECT uid
>>   FROM resource
>>  WHERE uid NOT IN (SELECT resource_uid
>>  FROM event_participant, event
>> WHERE event_participant.event_uid = event.uid
>>   AND event.shift_uid = :shift_uid
>>   AND event.date = :event_date)
>>
>> or perhaps
>>
>> SELECT uid
>>   FROM resource
>>  WHERE NOT EXISTS (SELECT 1
>>  FROM event_participant, event
>> WHERE event_participant.event_uid = event.uid
>>   AND resource_uid = resource.uid
>>   AND event.shift_uid = :shift_uid
>>   AND event.date = :event_date)
>>
>> Is not the "right way to do it" the one that obtains the result
>required, not the one that uses a "checkbox" implementation?
>
>I've presented 2 queries which both obtain the required result, and I
>would have to say both of them have varying levels of correctness.
>
>If obtaining the correct result was all that was required, then I could
>simply issue multiple queries and balloon my memory with result UIDs
>in between each of them, all for the sake of better readability (but
>then, I'm not sure why I would need a nice tool like SQLite to do it
>in the first place).
>
>So I would have to say, the "right way to do it" is the most efficient
>way, the one which provides SQLite with the best indications of how
>to plot an efficient query plan.
>
>The heuristics of the table is generally that resources are rather
>finite, while events grow over time (possibly 10 events each with
>10 participants for 3 shifts per day, meaning after a year that
>table will be large, and we'll still want to find the available
>resources hopefully in under 30ms, ideally 10ms).
>
>So yeah, speed at query time (not insert/update/delete time), is very
>important.

While either of the above are direct translations, if you wish to minimize 
materialization of intermediate data then the second would be preferable since 
it uses only inner joins.  However, the first may be far more efficient 
depending on the shape of your data and the number of rows, how much memory is 
available for SQLite to use (Bytes, KBytes, MBytes), and the speed of your I/O 
devices.

Given that selecting all the resources based on your stated heuristics of 
growth, having proper indexes, and the constraints given in the query, I would 
say that the first NOT IN will be more efficient since it is only building a 
list of a (couple) hundred or so exclusions, and NOT IN is very efficient.  The 
second form will use a few thousand bytes less memory but perform significantly 
more I/O.

Theoretically, of course, the execution plan of a query which obtains a 
specified result should not depend on the phrasing of the query.  They should 
all result in the same execution plan and the same result.  However, this is 
not true in practice because of limitations in optimization.  Some optimizers 
need a bit of help in figuring out how to generate an optimum plan, and some 
others will generate overly complicated and inefficient plans simply because 
they prefer to use newly added or nifty features only available in that 
particular database engine rather than a simpler, faster, more direct solution. 
 

I don't know why an outer join would be preferable in any case -- you are going 
to use extra CPU and I/O to perform a join on which a significant proportion of 
the results will be discarded -- or depend on the optimizer to notice this and 
"optimize" away the outer join thus ending up with a "more direct" phrasing 
which could have been written in the first place (and is far more 
understandable for future maintainers).

Many optimizers can only optimize within specific constraints and specifying a 
cute complicated query will result in the most efficient cute and complicated 
solution it can muster (ie, a local optimum, not a global optimum).

Your statement 1 implements NOT IN indirectly by doing a LEFT JOIN then 
discarding the intersection set.  This requires an extra join and discard over 
a more direct NOT IN phrasing.  It is also more difficult to understand.  It 
could be optimized into the NOT IN or NOT EXISTS form by the optimizer if you 
are lucky (presently it will not).

Your statement 2 implement the correlated subquery NOT EXISTS by an indirect 
expression, again using a LEFT JOIN then discarding the intersection results.  
Same comments regarding extra operations and optimizations that cannot 
(presently) be done apply.

If you prefer the ON syntax, then you may of course use it (though, except in 
the case of a LEFT OUTER JOIN, it is merely syntactic sugar for a list of 
tables to join and where conditions.

SELECT uid
  FROM resource
 WHERE uid NOT IN (SELECT resource_uid

Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-08 Thread Tristan Van Berkom
On Sat, 2014-11-08 at 10:23 -0500, James K. Lowden wrote:
> On Sat, 08 Nov 2014 22:55:46 +0900
> Tristan Van Berkom  wrote:
> 
> > So I would have to say, the "right way to do it" is the most efficient
> > way, the one which provides SQLite with the best indications of how
> > to plot an efficient query plan.
> 
> Keith is suggesting that the right way to do it is neither "any way that
> works" nor necessarily "whatever is fastest" but "the clearest
> formulation of the query".  Clarity has the salutary property of being
> most likely to be correct (because understood by the human) and stands
> a better than fair chance of being executed efficiently (because it
> translates easily to a good query plan).  
> 
> Most of time -- not every time, but most of the time -- indexes
> and table design matter much more to efficient execution than query
> syntax. When a clearly expressed query is not executed efficiently in
> the presence of useful indexes, and especially when a slightly different
> one does, that's usually considered a defect of the query planner.

I see what you're saying, as I mentioned in the initial email I do
consider myself to be a relative newbie, and the majority of work
I've been doing with SQL (until this year) has been working on
optimizing existing schemas/queries for embedded use.

While I do understand SQL as a functional language, most functional
programming I've done still has rather explicit syntax/rules, so I get
particularly uncomfortable with writing vague statements, such as
JOIN tableA, tableB WHERE ... without being exactly specific on the
heart/nature of the join which is happening.

Also what I've found in my limited experience is that nesting SELECT
statements, at least in SQLite, completely throws off the planner,
as in it has no opportunity to make a good guess and has to deal
with the result set of a nested SELECT as an opaque table, which
is generally not an indexed table (or perhaps it is, but I wouldnt
know because those indexes don't seem to be declarative in any way).

So indeed, I am not comfortable with 'leaving it up to chance',
and if there is a way to get higher specificity, I try to achieve that.

In any case, I do have queries which work well at this point, but
posted this question to the list in the hope I could find the right
specificity for the given query - it's not a very big deal, I'll
probably just stick with what works and try to fix it in the places
where profiling reveals that I'm doing something wrong.

Cheers,
-Tristan


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-08 Thread James K. Lowden
On Sat, 08 Nov 2014 22:55:46 +0900
Tristan Van Berkom  wrote:

> So I would have to say, the "right way to do it" is the most efficient
> way, the one which provides SQLite with the best indications of how
> to plot an efficient query plan.

Keith is suggesting that the right way to do it is neither "any way that
works" nor necessarily "whatever is fastest" but "the clearest
formulation of the query".  Clarity has the salutary property of being
most likely to be correct (because understood by the human) and stands
a better than fair chance of being executed efficiently (because it
translates easily to a good query plan).  

Most of time -- not every time, but most of the time -- indexes
and table design matter much more to efficient execution than query
syntax. When a clearly expressed query is not executed efficiently in
the presence of useful indexes, and especially when a slightly different
one does, that's usually considered a defect of the query planner.  

--jkl
 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-08 Thread Tristan Van Berkom
On Sat, 2014-11-08 at 06:23 -0700, Keith Medcalf wrote:
> How about the direct approach:
> 
> SELECT uid
>   FROM resource
>  WHERE uid NOT IN (SELECT resource_uid
>  FROM event_participant, event
> WHERE event_participant.event_uid = event.uid
>   AND event.shift_uid = :shift_uid
>   AND event.date = :event_date)
> 
> or perhaps 
> 
> SELECT uid
>   FROM resource
>  WHERE NOT EXISTS (SELECT 1
>  FROM event_participant, event
> WHERE event_participant.event_uid = event.uid
>   AND resource_uid = resource.uid
>   AND event.shift_uid = :shift_uid
>   AND event.date = :event_date)
> 
> Is not the "right way to do it" the one that obtains the result required, not 
> the one that uses a "checkbox" implementation?

I've presented 2 queries which both obtain the required result, and I
would have to say both of them have varying levels of correctness.

If obtaining the correct result was all that was required, then I could
simply issue multiple queries and balloon my memory with result UIDs
in between each of them, all for the sake of better readability (but
then, I'm not sure why I would need a nice tool like SQLite to do it
in the first place).

So I would have to say, the "right way to do it" is the most efficient
way, the one which provides SQLite with the best indications of how
to plot an efficient query plan.

The heuristics of the table is generally that resources are rather
finite, while events grow over time (possibly 10 events each with
10 participants for 3 shifts per day, meaning after a year that
table will be large, and we'll still want to find the available
resources hopefully in under 30ms, ideally 10ms).

So yeah, speed at query time (not insert/update/delete time), is very
important.

Best Regards,
-Tristan


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?

2014-11-08 Thread Keith Medcalf

How about the direct approach:

SELECT uid
  FROM resource
 WHERE uid NOT IN (SELECT resource_uid
 FROM event_participant, event
WHERE event_participant.event_uid = event.uid
  AND event.shift_uid = :shift_uid
  AND event.date = :event_date)

or perhaps 

SELECT uid
  FROM resource
 WHERE NOT EXISTS (SELECT 1
 FROM event_participant, event
WHERE event_participant.event_uid = event.uid
  AND resource_uid = resource.uid
  AND event.shift_uid = :shift_uid
  AND event.date = :event_date)

Is not the "right way to do it" the one that obtains the result required, not 
the one that uses a "checkbox" implementation?  Using the "checkbox" approach, 
someone might be of the opinion that the "right way to do it" is entirely 
within a hand-coded StartIO channel program implemented for a 3390 disk 
controller, and that is unlikely to be implementable on a bitty-box.


---
Theory is when you know everything but nothing works.  Practice is when 
everything works but no one knows why.  Sometimes theory and practice are 
combined:  nothing works and no one knows why.

>-Original Message-
>From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-
>boun...@sqlite.org] On Behalf Of Tristan Van Berkom
>Sent: Saturday, 8 November, 2014 03:17
>To: sqlite-users@sqlite.org
>Subject: [sqlite] What is the best alternative to this RIGHT OUTER JOIN ?
>
>Hi all,
>
>Today I've stumbled on a situation where I think I really need to use a
>RIGHT OUTER JOIN, and looking at all the examples on the internet I
>could find so far, I'm not finding a way to simulate it properly using
>LEFT OUTER JOINs.
>
>So I thought, before I commit to an inefficient alternative I would
>check to see if someone on this list has an idea of how I can form an
>efficient query for this.
>
>I'll start by highlighting some of the schema which is relevant to the
>query:
>
>CREATE TABLE shift (
>  uid INTEGER PRIMARY KEY AUTOINCREMENT,
>
>  ... name and start time/duration of a shift ...
>);
>
>CREATE TABLE resource (
>  uid INTEGER PRIMARY KEY AUTOINCREMENT,
>
>  ... details about the resource ...
>);
>
>CREATE TABLE event (
>  uidINTEGER  PRIMARY KEY AUTOINCREMENT,
>  date   INTEGER  NOT NULL, /* unix timestamp */
>  shift_uid  INTEGER  NOT NULL REFERENCES shift (uid),
>
>  ... other details about the event ...
>);
>
>CREATE TABLE IF NOT EXISTS event_participant (
>  id  INTEGER CHECK (id > 0),
>  event_uid   INTEGER REFERENCES event (uid) ON DELETE CASCADE,
>  resource_uidINTEGER REFERENCES resource (uid),
>
>  ... other details about this participant ...
>
>  PRIMARY KEY (event_uid, id)
>);
>
>
>Now my goal here, is to pull out every resource UID which is _not_
>assigned to any event on a given date and shift UID.
>
>My first attempt that gets the job done looks like this:
>
>~  Statement 1 ~
>
>SELECT resource.uid
>
>/* All resources... */
>FROM   resource
>
>/* Subtract from this set all the booked resources */
>LEFT JOIN (
>
>/* All resources which are booked on the given date / shift_uid */
>SELECT participant.resource_uid AS uid
>FROM event
>JOIN event_participant
>  AS participant
>  ON (participant.event_uid = event.uid)
>WHERE (event.date  = 1411185600 AND
>   event.shift_uid = 1 AND
>   participant.resource_uid IS NOT NULL)
>
>) AS booked_resource ON (booked_resource.uid = resource.uid)
>
>WHERE booked_resource.uid IS NULL;
>
>~ EXPLAIN SAYS ~
>1|0|0|SEARCH TABLE event USING INDEX event_shift_uid_idx (shift_uid=?)
>(~2 rows)
>1|1|1|SEARCH TABLE event_participant AS participant USING INDEX
>event_participant_pk_idx (event_uid=?) (~5 rows)
>0|0|0|SCAN TABLE resource (~100 rows)
>0|1|1|SEARCH SUBQUERY 1 AS booked_resource USING AUTOMATIC COVERING
>INDEX (uid=?) (~2 rows)
>
>
>Now, this is not all that bad, except that I try to avoid nesting select
>statements in cases like this as it prevents the query planner from
>doing anything intelligent.
>
>
>So I made another attempt which discards the nested statement:
>
>~  Statement 2 ~
>
>SELECT resource.uid
>
>/* All Resources */
>FROM resource
>
>/* Get the cartesian product of