[sqlite] Warnings for non-deterministic queries?

2015-11-28 Thread Simon Slavin

On 28 Nov 2015, at 4:19pm, Keith Medcalf  wrote:

> The long and the short of it is that the result is exactly correct (and 
> *always* entirely predictable).

The /result/ is exactly correct and entirely predictable.  If you consider the 
result to be just the thing you asked for.

How you get there is exactly correct and entirely predictable if and only if 
you've read and understood the source code to SQLite and understand exactly the 
content of your database file.  And the whole point of having someone else 
write you a DBMS is give you answers without you having to do that.

If you want something that you can entirely predict you're going to have to 
write and debug it yourself.

Simon.


[sqlite] Warnings for non-deterministic queries?

2015-11-28 Thread R Smith


On 2015/11/27 9:58 PM, Simon Slavin wrote:
> On 27 Nov 2015, at 6:30pm, R Smith  wrote:
>
>> Let me explain better, let's assume the query contains MAX(x)... The result 
>> will be from whichever row contains the max, and if the x was not in the 
>> aggregate function, it will be from some row (which may or may not be the 
>> one with the max value).
>> In both cases, you have NO chance of knowing which row that would be until 
>> the query finishes, so ALWAYS assume an arbitrary row (or 
>> Always-non-deterministic, to use the OP's words).
> You missed the general principle: there may be any number of rows with the 
> max value.  For example, 8,1,6,3,8,4,8,7,6,8 .  So even if you're using MAX 
> you still can't predict which row is supplying the other values.  So as you 
> (R Smith) says, always assume an arbitrary row.

Indeed you are correct - pardon me not making it clearer - the point was 
to demonstrate using an example that would "feel" deterministic and show 
that even in that case the arbitrary-pick should be expected or catered 
for. Your example further highlights this (and is probably the more 
common case).



On 2015/11/27 11:08 PM, T?r?k Edwin wrote:
> On 11/27/2015 08:30 PM, R Smith wrote:
>> On 2015/11/27 4:44 PM, Adam Devita wrote:
>>> I think the OP meant to write:
>>> "If the expression is an aggregate expression, it is evaluated across
>>> all rows in the group. Otherwise, it is evaluated against a single
>>> arbitrarily chosen row from within the group. "
>>>
>>> Is there a way I could programmatically determine that a query is
>>> going to use an arbitrarily chosen row from within the group at query
>>> prepare time?
> Thanks, non-deterministic was probably the wrong term to use.
> I wanted to find situations where a query's result depends on an 
> implementation detail of SQLite, and the behaviour is not fully specified by 
> the query itself,
> i.e. it could change from one version to the next, or even with same version 
> by slight changes to the DB internal structures.
> So far I know of two possible situations like this: the 'arbitrary row 
> choice' in the question above, and order of results in an unordered select.
>
> For the unordered select there is 'PRAGMA reverse_unordered_selects' that I 
> could use to find bugs, and I wanted to know if there is an equivalent 
> functionality for the arbitrary row choice.
> However as shown below this is probably the wrong question to ask, I should 
> avoid using columns that are not part of an aggregate function/group by in 
> the first place (if using group by).
>

That's a fair request and probably achievable, but a simple habit of not 
expecting specific row-picks should do.

Also, the row-order problem (and accompanying pragma) is not quite in 
the same class as the arbitrary-row-picked problem.

The simplest way to put this (that I can think of) is that with the 
unordered-select pragma, both the outputs BEFORE enabling it and AFTER 
enabling it is determinable and precise (at least within SQLite) for the 
chosen query plan. You can further be guaranteed that the outputs (where 
unordered) will be rearranged "the other way round" whenever two or more 
dissimilar rows are produced.

To enable some pragma that will perhaps pick a different arbitrary row 
depends on a lot of factors and cannot possibly be predetermined. There 
is no "other-way-round" that the picks can go and no guarantee the data 
will have other pick-able options, or indeed how many. i.e. your app 
might pass with flying colours now, and still later break when 
different/more data is encountered or the query optimizer picks a 
different query plan in future.

It should not be that hard to find all aggregated queries and check if 
the user (or code using the result) will depend on a column that isn't 
contained in an aggregate function. Even in a rather big system.

Best of luck!
Ryan




[sqlite] Warnings for non-deterministic queries?

2015-11-28 Thread Keith Medcalf

> Indeed you are correct - pardon me not making it clearer - the point was
> to demonstrate using an example that would "feel" deterministic and show
> that even in that case the arbitrary-pick should be expected or catered
> for. Your example further highlights this (and is probably the more
> common case).

The long and the short of it is that the result is exactly correct (and 
*always* entirely predictable).  There is no "arbitrary-pick".  If you think it 
is arbitrary or cannot explain the results obtained, the problem is that the 
question asked was ill-conceived.  You should never ask a question to which you 
do not already know the answer (or, in this case, compute the answer so as to 
know whether it is correct or not).






[sqlite] Warnings for non-deterministic queries?

2015-11-27 Thread Török Edwin
On 11/27/2015 08:30 PM, R Smith wrote:
> 
> 
> On 2015/11/27 4:44 PM, Adam Devita wrote:
>> I think the OP meant to write:
>> "If the expression is an aggregate expression, it is evaluated across
>> all rows in the group. Otherwise, it is evaluated against a single
>> arbitrarily chosen row from within the group. "
>>
>> Is there a way I could programmatically determine that a query is
>> going to use an arbitrarily chosen row from within the group at query
>> prepare time?

Thanks, non-deterministic was probably the wrong term to use.
I wanted to find situations where a query's result depends on an implementation 
detail of SQLite, and the behaviour is not fully specified by the query itself,
i.e. it could change from one version to the next, or even with same version by 
slight changes to the DB internal structures.
So far I know of two possible situations like this: the 'arbitrary row choice' 
in the question above, and order of results in an unordered select.

For the unordered select there is 'PRAGMA reverse_unordered_selects' that I 
could use to find bugs, and I wanted to know if there is an equivalent 
functionality for the arbitrary row choice.
However as shown below this is probably the wrong question to ask, I should 
avoid using columns that are not part of an aggregate function/group by in the 
first place (if using group by).

> 
> I think you are correct about what the OP meant. Which makes it easier to 
> answer:
> 
> There is no need to programmatically check whether the row selected will be 
> arbitrary or not. If the reference is inside an aggregate function, then it 
> is evaluated across all rows (i.e the result is some unknown), and if it 
> isn't, then an arbitrary row is picked and the result is equally unknown 
> before the query completes.
> 
> Programmatically you can just always assume the result will be "some" value 
> in the database.
> 
> Let me explain better, let's assume the query contains MAX(x)... The result 
> will be from whichever row contains the max, and if the x was not in the 
> aggregate function, it will be from some row (which may or may not be the one 
> with the max value).
> In both cases, you have NO chance of knowing which row that would be until 
> the query finishes, so ALWAYS assume an arbitrary row (or 
> Always-non-deterministic, to use the OP's words).
> 
> Other SQL engines enforces a more strict (and I have to say: more correct) 
> semantic where every term NOT in the GROUP BY clause MUST be contained in an 
> aggregate function.

Thanks, this is actually what I was looking for. Can I enforce/detect when a 
query adheres to this more strict semantics in SQLite?

> Either way, the returned result will never be pre-determinable by an onlooker 
> function (which isn't inspecting the data via other means) and as such there 
> is no difference between being contained in an aggregate function or not - 
> the resulting value will never be pre-determinable - and in the case of 
> SQLite (where not contained in an aggregate function), the result may even 
> differ from a previous run (according to documentation, though my experience 
> is that it always returns the exact same result, so it maintains determinism, 
> but you shouldn't bank on it.)
> 
> Bottom line: Just put the darn query fields in aggregate functions.

Good advice, using SUM() for all columns in this case would've avoided my 
original problem (even where only one row is expected because a primary key for 
same table was in group by).

> If you are making an SQLite wrapper, I would even advise enforcing this 
> behaviour even though SQLite can technically make do without it.

I'm not writing a wrapper, just adding some checks to my application from 
time-to-time that print more details when it is possible to determine what went 
wrong via an SQLite API.
For example if sqlite3_close fails with SQLITE_BUSY I loop through 
sqlite3_next_stmt and print the queries that were not finalized properly.

-- 
Edwin T?r?k | Co-founder and Lead Developer

Skylable open-source object storage: reliable, fast, secure
http://www.skylable.com


[sqlite] Warnings for non-deterministic queries?

2015-11-27 Thread Simon Slavin

On 27 Nov 2015, at 9:08pm, T?r?k Edwin  wrote:

> Thanks, non-deterministic was probably the wrong term to use.
> I wanted to find situations where a query's result depends on an 
> implementation detail of SQLite, and the behaviour is not fully specified by 
> the query itself,
> i.e. it could change from one version to the next, or even with same version 
> by slight changes to the DB internal structures.
> So far I know of two possible situations like this: the 'arbitrary row 
> choice' in the question above, and order of results in an unordered select.

Okay.  That's a sensible thing to want.  The bad news is that there are many of 
them.

For instance, in your post you mention "and order of results in an unordered 
select".  However, even in an ordered SELECT the order of results can change 
because two rows may have the same value.  The order that SQLite returns those 
rows can depend on the order in which the rows were INSERTed or UPDATEd, and on 
which indexes are available.

Another category of unpredictable things involve JOINs where the programmer of 
the JOIN assumed that only one row will satisfy the 'ON' clause.  Again 
depending on indexes and data changes a row in one table might be matched with 
one row for one command but another for another command.

Other unpredictable things are things involving random numbers (i.e. the random 
core function or external functions) and anything involving multiple threads or 
processes.

And just as you write, all the above behaviours can change in different 
versions of SQLite so even if you do detailed detective work using the current 
version it might all be obsolete next week.

Simon.


[sqlite] Warnings for non-deterministic queries?

2015-11-27 Thread Bernardo Sulzbach
On Fri, Nov 27, 2015 at 8:29 PM, Simon Slavin  wrote:
>
> And just as you write, all the above behaviours can change in different 
> versions of SQLite so even if you do detailed detective work using the 
> current version it might all be obsolete next week.
>
> Simon.
>

Exactly. One of the biggest advantages of not formally documenting
what will be selected (even if it at some point in time this was
"deterministic") is that the implementation can be more flexibly
manipulated (the developers didn't sign any contract).

A **warning** about this seems (to me) excessive. In the end of the
day, you shouldn't be relying on these arbitrary results anyway,
databases are dynamic (most of the tables are, at least) and you
should be prepared to handle a different query result from time to
time.


-- 
Bernardo Sulzbach


[sqlite] Warnings for non-deterministic queries?

2015-11-27 Thread R Smith


On 2015/11/27 4:44 PM, Adam Devita wrote:
> I think the OP meant to write:
> "If the expression is an aggregate expression, it is evaluated across
> all rows in the group. Otherwise, it is evaluated against a single
> arbitrarily chosen row from within the group. "
>
> Is there a way I could programmatically determine that a query is
> going to use an arbitrarily chosen row from within the group at query
> prepare time?

I think you are correct about what the OP meant. Which makes it easier 
to answer:

There is no need to programmatically check whether the row selected will 
be arbitrary or not. If the reference is inside an aggregate function, 
then it is evaluated across all rows (i.e the result is some unknown), 
and if it isn't, then an arbitrary row is picked and the result is 
equally unknown before the query completes.

Programmatically you can just always assume the result will be "some" 
value in the database.

Let me explain better, let's assume the query contains MAX(x)... The 
result will be from whichever row contains the max, and if the x was not 
in the aggregate function, it will be from some row (which may or may 
not be the one with the max value).
In both cases, you have NO chance of knowing which row that would be 
until the query finishes, so ALWAYS assume an arbitrary row (or 
Always-non-deterministic, to use the OP's words).

Other SQL engines enforces a more strict (and I have to say: more 
correct) semantic where every term NOT in the GROUP BY clause MUST be 
contained in an aggregate function. Either way, the returned result will 
never be pre-determinable by an onlooker function (which isn't 
inspecting the data via other means) and as such there is no difference 
between being contained in an aggregate function or not - the resulting 
value will never be pre-determinable - and in the case of SQLite (where 
not contained in an aggregate function), the result may even differ from 
a previous run (according to documentation, though my experience is that 
it always returns the exact same result, so it maintains determinism, 
but you shouldn't bank on it.)

Bottom line: Just put the darn query fields in aggregate functions. If 
you are making an SQLite wrapper, I would even advise enforcing this 
behaviour even though SQLite can technically make do without it.




[sqlite] Warnings for non-deterministic queries?

2015-11-27 Thread Simon Slavin

On 27 Nov 2015, at 6:30pm, R Smith  wrote:

> Let me explain better, let's assume the query contains MAX(x)... The result 
> will be from whichever row contains the max, and if the x was not in the 
> aggregate function, it will be from some row (which may or may not be the one 
> with the max value).
> In both cases, you have NO chance of knowing which row that would be until 
> the query finishes, so ALWAYS assume an arbitrary row (or 
> Always-non-deterministic, to use the OP's words).

You missed the general principle: there may be any number of rows with the max 
value.  For example, 8,1,6,3,8,4,8,7,6,8 .  So even if you're using MAX you 
still can't predict which row is supplying the other values.  So as you (R 
Smith) says, always assume an arbitrary row.

Simon.


[sqlite] Warnings for non-deterministic queries?

2015-11-27 Thread Darren Duncan
On 2015-11-27 5:46 AM, Keith Medcalf wrote:
>> Is there a way I could programatically determine that a query is non-
>> deterministic at query prepare time?
>
> What do you mean, non-deterministic?  The result is deterministic in all 
> cases.
>
> It may be complicated and/or difficult for you to compute, but it is always 
> deterministic.  The result is generated by running an unchanging algorithm on 
> unchanging data.  If there is no random inputs and the computer hardware is 
> not broken, then the results are entirely determined by the algorithm 
> executed and the state of the data upon which it is operating.

While what you say is true, deterministic if same algorithm and same data, I 
think there's a higher standard for determinism.  The concept of "arbitrary 
row" 
presumably is based on certain implementation details like the structure of an 
index or other hidden metadata, which can change even if there are no 
user-visible changes to the database.  Unless the algorithm guarantees that the 
exact same row will be selected whenever the user-visible parts of the database 
have the exact same value, it is not actually deterministic from the user's 
point of view, which I think is what really matters here.  Selecting an 
"arbitrary row" can only be called deterministic otherwise if the user is able 
to query all of the conditions that would make it chosen, such as index 
details, 
as are applicable.  Any visible change is significant; if the user can get a 
different answer to any question about the database, including 'hidden' parts, 
then the database is different, whereas if all questions they can ask return 
the 
same answer, then the "arbitrary row" should be the same row. -- Darren Duncan



[sqlite] Warnings for non-deterministic queries?

2015-11-27 Thread Török Edwin
Hi,

I mistakenly used the wrong side of the ON caluse in a group by clause for a 
query and I was wondering why SQLite didn't return the same results always:
- SELECT SUM(x) FROM (SELECT files.size + LENGTH(CAST(files.name AS BLOB)) 
+ SUM(COALESCE(LENGTH(CAST(fmeta.key AS BLOB)) + LENGTH(fmeta.value),0)) AS x 
FROM files LEFT JOIN fmeta ON files.fid = fmeta.file_id WHERE files.volume_id = 
:volid AND age >= 0 GROUP BY fmeta.file_id
+   SELECT SUM(x) FROM (SELECT files.size + LENGTH(CAST(files.name AS 
BLOB)) + SUM(COALESCE(LENGTH(CAST(fmeta.key AS BLOB)) + LENGTH(fmeta.value),0)) 
AS x FROM files LEFT JOIN fmeta ON files.fid = fmeta.file_id WHERE 
files.volume_id = :volid AND age >= 0 GROUP BY files.fid

Then I realized that fmeta.file_id can be NULL, and then it no longer groups by 
file id and just selects a random row as documented:
"If the expression is an aggregate expression, it is evaluated across all rows 
in the group. Otherwise, it is evaluated against a single arbitrarily chosen 
row from within the group. "

Is there a way I could programatically determine that a query is 
non-deterministic at query prepare time?

-- 
Edwin T?r?k | Co-founder and Lead Developer

Skylable open-source object storage: reliable, fast, secure
http://www.skylable.com


[sqlite] Warnings for non-deterministic queries?

2015-11-27 Thread Adam Devita
I think the OP meant to write:
"If the expression is an aggregate expression, it is evaluated across
all rows in the group. Otherwise, it is evaluated against a single
arbitrarily chosen row from within the group. "

Is there a way I could programmatically determine that a query is
going to use an arbitrarily chosen row from within the group at query
prepare time?


Adam

On Fri, Nov 27, 2015 at 8:46 AM, Keith Medcalf  wrote:
>
>
>> Is there a way I could programatically determine that a query is non-
>> deterministic at query prepare time?
>
> What do you mean, non-deterministic?  The result is deterministic in all 
> cases.
>
> It may be complicated and/or difficult for you to compute, but it is always 
> deterministic.  The result is generated by running an unchanging algorithm on 
> unchanging data.  If there is no random inputs and the computer hardware is 
> not broken, then the results are entirely determined by the algorithm 
> executed and the state of the data upon which it is operating.
>
>
>
>
>
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



-- 
--
VerifEye Technologies Inc.
151 Whitehall Dr. Unit 2
Markham, ON
L3R 9T1


[sqlite] Warnings for non-deterministic queries?

2015-11-27 Thread Keith Medcalf


> Is there a way I could programatically determine that a query is non-
> deterministic at query prepare time?

What do you mean, non-deterministic?  The result is deterministic in all cases. 
 

It may be complicated and/or difficult for you to compute, but it is always 
deterministic.  The result is generated by running an unchanging algorithm on 
unchanging data.  If there is no random inputs and the computer hardware is not 
broken, then the results are entirely determined by the algorithm executed and 
the state of the data upon which it is operating.