Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-21 Thread Andy Bennett

Hi,

Thanks to everyone who helped with this!

I'll try some stuff out and see if I can get things efficient, fast *and* 
simple.


:-)


"There's a small sidenote (that I'm too lazy too find right 
now) in the select docs that mentions that, in case of using min 
or max as aggregate, the non-aggregate columns will come from 
the row that held the min/max value."



Look in
https://www.sqlite.org/quirks.html
under "6. Aggregate Queries Can Contain Non-Aggregate Result 
Columns That Are Not In The GROUP BY Clause"


and also in
https://www.sqlite.org/lang_select.html
In section 3 search for: "Side note: Bare columns in an aggregate queries."


--
Best wishes,
@ndy

--
andy...@ashurst.eu.org
http://www.ashurst.eu.org/
0x7EBA75FF
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread Richard Damon
On 11/20/19 1:26 PM, Simon Slavin wrote:
> On 20 Nov 2019, at 6:11pm, Andy Bennett  wrote:
>
>> In past attempts at improving query performance these have been added to 
>> encourage it to use an index that it can do a SCAN thru' rather than the 
>> table that it would need to do a SEARCH thru'.
> SQLite is not using the PRIMARY INDEX to immediately locate the appropriate 
> row, but is actually faster when you fake it into using a longer index ?  
> That's weird.

I think the issue is that the 'Primary Index' isn't really the primary index, 
but that the implied ROWID (since the table isn't WITHOUT ROWID, and the 
Primary Key isn't INTEGER PRIMARY KEY) will be the primary key. Thus a lookup 
of a row with his declared primary key is a two step lookup, find the key in 
the Unique index for the key, and then lookup the specified record by ROWID, if 
there is a index with the needed data, then the second lookup isn't needed.

-- 
Richard Damon

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread Keith Medcalf

Yes.  See under item #3 in the Side note on https://sqlite.org/lang_select.html

Special processing occurs when the aggregate function is either min() or max(). 
Example:

SELECT a, b, max(c) FROM tab1 GROUP BY a;

When the min() or max() aggregate functions are used in an aggregate query, all 
bare columns in the result set take values from the input row which also 
contains the minimum or maximum. So in the query above, the value of the "b" 
column in the output will be the value of the "b" column in the input row that 
has the largest "c" value. There is still an ambiguity if two or more of the 
input rows have the same minimum or maximum value or if the query contains more 
than one min() and/or max() aggregate function. Only the built-in min() and 
max() functions work this way.

Many other RDBMS used to do this (ie, Sybase which became MS SQL Server, 
amongst others).  This was "done away with" in most other SQL implementations 
and is indeed "non-standard" behaviour.  The standard generally requires that 
items in the select list either be aggregates or in the group by list.

This is a specific implementation detail/feature of SQLite3 and will (probably) 
not be portable elsewhere ...

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

>-Original Message-
>From: sqlite-users  On
>Behalf Of Andy Bennett
>Sent: Wednesday, 20 November, 2019 12:37
>To: sqlite-users@mailinglists.sqlite.org
>Subject: Re: [sqlite] Optimising query with aggregate in subselect.
>
>Hi,
>
>> Did you try retrieving the data "directly" or do you need the
>> subselect in order to maintain compatibility with other SQL
>> dialects that are no longer able to retrieve data from the row
>> on which the max was found?
>
>Thanks Keith!
>
>I understood that selecting other columns during an aggregate lead to
>ill-specific or undefined values in those columns. Does SQLite make more
>guarantees than the SQL standard here? Do you have a pointer to the docs
>as
>I tried and failed to find it in there.
>
>
>
>>
>> CREATE TABLE entrys
>> (
>> logid   INTEGER NOT NULL,
>> entrynumber INTEGER NOT NULL,
>> region  TEXT NOT NULL,
>> key TEXT NOT NULL,
>> timestamp   INTEGER NOT NULL,
>> PRIMARY KEY (logid, entrynumber)
>> );
>>
>> CREATE INDEX a on entrys (region, logid, key, entrynumber);
>>
>>   SELECT entrys.logidAS logid,
>>  max(entrys.entrynumber) AS entrynumber,
>>  entrys.region   AS region,
>>  entrys.key  AS key,
>>  entrys.timestampAS timestamp
>> FROM entrys
>>WHERE entrys.region = ?
>>  AND entrys.key > ?
>>  AND entrys.logid = ?
>> GROUP BY key
>> ;
>>
>> NB:  I changed the ill-conceived column names to ones that do
>> not require quoting and the identifier quoted items that are not
>> column names with parameter markers.
>
>
>
>
>
>Best wishes,
>@ndy
>
>--
>andy...@ashurst.eu.org
>http://www.ashurst.eu.org/
>0x7EBA75FF
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread David Raymond
"There's a small sidenote (that I'm too lazy too find right now) in the select 
docs that mentions that, in case of using min or max as aggregate, the 
non-aggregate columns will come from the row that held the min/max value."


Look in
https://www.sqlite.org/quirks.html
under "6. Aggregate Queries Can Contain Non-Aggregate Result Columns That Are 
Not In The GROUP BY Clause"

and also in
https://www.sqlite.org/lang_select.html
In section 3 search for: "Side note: Bare columns in an aggregate queries."
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread Merijn Verstraaten


> On 20 Nov 2019, at 20:37, Andy Bennett  wrote:
> 
> Hi,
> 
>> Did you try retrieving the data "directly" or do you need the subselect in 
>> order to maintain compatibility with other SQL dialects that are no longer 
>> able to retrieve data from the row on which the max was found?
> 
> Thanks Keith!
> 
> I understood that selecting other columns during an aggregate lead to 
> ill-specific or undefined values in those columns. Does SQLite make more 
> guarantees than the SQL standard here? Do you have a pointer to the docs as I 
> tried and failed to find it in there.

There's a small sidenote (that I'm too lazy too find right now) in the select 
docs that mentions that, in case of using min or max as aggregate, the 
non-aggregate columns will come from the row that held the min/max value.

Kind regards,
Merijn


signature.asc
Description: Message signed with OpenPGP
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread Andy Bennett

Hi,

In past attempts at improving query performance these have 
been added to encourage it to use an index that it can do a 
SCAN thru' rather than the table that it would need to do a 
SEARCH thru'.


SQLite is not using the PRIMARY INDEX to immediately locate the 
appropriate row, but is actually faster when you fake it into 
using a longer index ?  That's weird.


There is more than one row returned. Potentially several million.

...so it's quicker to do a linear scan thru' something that's in the same 
order (table or index) rather than a series of tree accesses for each and 
every row.




Best wishes,
@ndy

--
andy...@ashurst.eu.org
http://www.ashurst.eu.org/
0x7EBA75FF
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread Andy Bennett

Hi,

Did you try retrieving the data "directly" or do you need the 
subselect in order to maintain compatibility with other SQL 
dialects that are no longer able to retrieve data from the row 
on which the max was found?


Thanks Keith!

I understood that selecting other columns during an aggregate lead to 
ill-specific or undefined values in those columns. Does SQLite make more 
guarantees than the SQL standard here? Do you have a pointer to the docs as 
I tried and failed to find it in there.






CREATE TABLE entrys
(
logid   INTEGER NOT NULL,
entrynumber INTEGER NOT NULL,
region  TEXT NOT NULL,
key TEXT NOT NULL,
timestamp   INTEGER NOT NULL,
PRIMARY KEY (logid, entrynumber)
);

CREATE INDEX a on entrys (region, logid, key, entrynumber);

  SELECT entrys.logidAS logid,
 max(entrys.entrynumber) AS entrynumber,
 entrys.region   AS region,
 entrys.key  AS key,
 entrys.timestampAS timestamp
FROM entrys
   WHERE entrys.region = ?
 AND entrys.key > ?
 AND entrys.logid = ?
GROUP BY key
;

NB:  I changed the ill-conceived column names to ones that do 
not require quoting and the identifier quoted items that are not 
column names with parameter markers.






Best wishes,
@ndy

--
andy...@ashurst.eu.org
http://www.ashurst.eu.org/
0x7EBA75FF
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread Keith Medcalf

Did you try retrieving the data "directly" or do you need the subselect in 
order to maintain compatibility with other SQL dialects that are no longer able 
to retrieve data from the row on which the max was found?

CREATE TABLE entrys
(
logid   INTEGER NOT NULL,
entrynumber INTEGER NOT NULL,
region  TEXT NOT NULL,
key TEXT NOT NULL,
timestamp   INTEGER NOT NULL,
PRIMARY KEY (logid, entrynumber)
);

CREATE INDEX a on entrys (region, logid, key, entrynumber);

  SELECT entrys.logidAS logid,
 max(entrys.entrynumber) AS entrynumber,
 entrys.region   AS region,
 entrys.key  AS key,
 entrys.timestampAS timestamp
FROM entrys
   WHERE entrys.region = ?
 AND entrys.key > ?
 AND entrys.logid = ?
GROUP BY key
;

NB:  I changed the ill-conceived column names to ones that do not require 
quoting and the identifier quoted items that are not column names with 
parameter markers.

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

>-Original Message-
>From: sqlite-users  On
>Behalf Of Andy Bennett
>Sent: Wednesday, 20 November, 2019 09:49
>To: sqlite-users@mailinglists.sqlite.org
>Subject: [sqlite] Optimising query with aggregate in subselect.
>
>Hi,
>
>I'm trying to implement a "streaming" version of the classic "select the
>latest version of a record" query.
>
>
>By "streaming" I mean a query that executes by streaming what it needs
>out
>of tables and indexes as it needs it rather than using temporary b-trees
>or
>materializing anything up front.
>
>I'm looking for a query that I can run and then just consume as many
>results as I want without worrying about the size of the entire result
>set.
>
>
>Here's the schema I'm working with:
>
>-
>CREATE TABLE "entrys" ("log-id" INTEGER NOT NULL , "entry-number" INTEGER
>NOT NULL , "region" TEXT NOT NULL , "key" TEXT NOT NULL , "timestamp"
>INTEGER NOT NULL , PRIMARY KEY ("log-id", "entry-number"))
>
>CREATE UNIQUE INDEX "entrys-log-id-region-key-entry-number" ON "entrys" (
>"log-id" ASC, "region" ASC, "key" ASC, "entry-number" ASC)
>-
>
>There's only a couple of million rows in "entrys" and my query times are
>into 2 or 3 seconds of startup time before the first row is returned.
>
>
>
>Here's my query:
>
>-
>-- explain query plan
>SELECT
>"entrys"."log-id"   AS "log-id",
>"entrys"."entry-number" AS "entry-number",
>"entrys"."region"   AS "region",
>"entrys"."key"  AS "key",
>"entrys"."timestamp"AS "timestamp"
>
>FROM
>   (SELECT
>   MAX("entry-number") AS "entry-number",
>   "key"
>   FROM "entrys"
>   WHERE
>   "log-id" = 1 AND
>   "region" = "user" AND
>   "entry-number" <= 1700108
>   AND key > "G"
>   GROUP BY "key"
>   ORDER BY "key" DESC
>   limit 20 -- (1)
>   ) AS "specific-entrys"
>
>INNER JOIN "entrys"
>ON
>1 = "entrys"."log-id" AND
>"specific-entrys"."key" = "entrys"."key" AND
>"user" = "entrys"."region" AND
>"specific-entrys"."entry-number" = "entrys"."entry-number"
>AND "entrys"."key" > "G"
>
>WHERE
>"entrys"."log-id" = 1
>
>ORDER BY "key" ASC
>;
>-
>
>...which has this query plan in SQLite verson 3.31.0
>
>-
>QUERY PLAN
>|--MATERIALIZE 1
>|  `--SEARCH TABLE entrys USING COVERING INDEX
>entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key|--SEARCH TABLE entrys USING INDEX entrys-log-id-region-key-entry-number
>(log-id=? AND region=? AND key`--SEARCH SUBQUERY 1 AS specific-entrys USING AUTOMATIC COVERING INDEX
>(key=?)
>-
>
>
>My problem is with the MATERIALIZE.
>
>The query produces just shy of 2 million rows.
>
>It takes several seconds to start up but is then pretty quick when
>fetching
>each row.
>
>What I want to do is get rid of the startup costs so that I can paginate
>it
>efficiently.
>
>If I run the subselect on its own then there is no startup cost. The
>results just get streamed straight out of the index.
>Its query plan is
>
>-
>QUERY PLAN
>`--SEARCH TABLE entrys USING COVERING INDEX
>entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key-
>
>
>Is there anything I can do to make the original version of the query
>stream
>the results in this way?
>
>The best I have come up with is to insert a LIMIT clause at the point
>denoted with "-- (1)". This keeps the subselect small and then
>materialising the subselect and generating the automatic covering index
>becomes cheap.
>
>For pagination I then feed in the key from the last row of the previous
>batch at the points denoted with "-- (2)" and "-- (3)".
>
>If I do this then it seems equally cheap to access batches at the start
>and
>end of the complete result set. The per-query cost is determined by the
>batch size as set but the LIMIT clause.
>
>
>However, I have been under the impression that LIMIT is supposed to be a
>"debugging" extension to the language and not 

Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread Simon Slavin
On 20 Nov 2019, at 6:11pm, Andy Bennett  wrote:

> In past attempts at improving query performance these have been added to 
> encourage it to use an index that it can do a SCAN thru' rather than the 
> table that it would need to do a SEARCH thru'.

SQLite is not using the PRIMARY INDEX to immediately locate the appropriate 
row, but is actually faster when you fake it into using a longer index ?  
That's weird.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread Andy Bennett

Hi,


INNER JOIN "entrys"
ON
1 = "entrys"."log-id" AND
"specific-entrys"."key" = "entrys"."key" AND
"user" = "entrys"."region" AND
"specific-entrys"."entry-number" = "entrys"."entry-number"
AND "entrys"."key" > "G"


I can't solve your problem, but the PRIMARY KEY for "entrys" is

("log-id", "entry-number")

.  You shouldn't need to match so many different fields when 
just two of them, which you already have values for, narrow your 
search down to a single row.  Though I may be missing something.


In past attempts at improving query performance these have been added to 
encourage it to use an index that it can do a SCAN thru' rather than the 
table that it would need to do a SEARCH thru'.


I'm pretty happy with the indexes it's currently choosing (apart from the 
MATERIALIZE). Adding a covering index on timestamp theoretically improves 
things but doesn't seem to make a (measurable) difference in practice with 
current data sizes.





Best wishes,
@ndy

--
andy...@ashurst.eu.org
http://www.ashurst.eu.org/
0x7EBA75FF
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread Simon Slavin
On 20 Nov 2019, at 4:49pm, Andy Bennett  wrote:

> INNER JOIN "entrys"
> ON
> 1 = "entrys"."log-id" AND
> "specific-entrys"."key" = "entrys"."key" AND
> "user" = "entrys"."region" AND
> "specific-entrys"."entry-number" = "entrys"."entry-number"
> AND "entrys"."key" > "G"

I can't solve your problem, but the PRIMARY KEY for "entrys" is

("log-id", "entry-number")

.  You shouldn't need to match so many different fields when just two of them, 
which you already have values for, narrow your search down to a single row.  
Though I may be missing something.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Optimising query with aggregate in subselect.

2019-11-20 Thread Andy Bennett

Hi,


ORDER BY "key" DESC


This should be ASC, not DESC: I've been working on versions of the query 
that can go forwards and backwards and made an editor snafu when writing 
the eMail.




Best wishes,
@ndy

--
andy...@ashurst.eu.org
http://www.ashurst.eu.org/
0x7EBA75FF
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users