Re: [sqlite] Optimising query with aggregate in subselect.
Hi, Thanks to everyone who helped with this! I'll try some stuff out and see if I can get things efficient, fast *and* simple. :-) "There's a small sidenote (that I'm too lazy too find right now) in the select docs that mentions that, in case of using min or max as aggregate, the non-aggregate columns will come from the row that held the min/max value." Look in https://www.sqlite.org/quirks.html under "6. Aggregate Queries Can Contain Non-Aggregate Result Columns That Are Not In The GROUP BY Clause" and also in https://www.sqlite.org/lang_select.html In section 3 search for: "Side note: Bare columns in an aggregate queries." -- Best wishes, @ndy -- andy...@ashurst.eu.org http://www.ashurst.eu.org/ 0x7EBA75FF ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Optimising query with aggregate in subselect.
On 11/20/19 1:26 PM, Simon Slavin wrote: > On 20 Nov 2019, at 6:11pm, Andy Bennett wrote: > >> In past attempts at improving query performance these have been added to >> encourage it to use an index that it can do a SCAN thru' rather than the >> table that it would need to do a SEARCH thru'. > SQLite is not using the PRIMARY INDEX to immediately locate the appropriate > row, but is actually faster when you fake it into using a longer index ? > That's weird. I think the issue is that the 'Primary Index' isn't really the primary index, but that the implied ROWID (since the table isn't WITHOUT ROWID, and the Primary Key isn't INTEGER PRIMARY KEY) will be the primary key. Thus a lookup of a row with his declared primary key is a two step lookup, find the key in the Unique index for the key, and then lookup the specified record by ROWID, if there is a index with the needed data, then the second lookup isn't needed. -- Richard Damon ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Optimising query with aggregate in subselect.
Yes. See under item #3 in the Side note on https://sqlite.org/lang_select.html Special processing occurs when the aggregate function is either min() or max(). Example: SELECT a, b, max(c) FROM tab1 GROUP BY a; When the min() or max() aggregate functions are used in an aggregate query, all bare columns in the result set take values from the input row which also contains the minimum or maximum. So in the query above, the value of the "b" column in the output will be the value of the "b" column in the input row that has the largest "c" value. There is still an ambiguity if two or more of the input rows have the same minimum or maximum value or if the query contains more than one min() and/or max() aggregate function. Only the built-in min() and max() functions work this way. Many other RDBMS used to do this (ie, Sybase which became MS SQL Server, amongst others). This was "done away with" in most other SQL implementations and is indeed "non-standard" behaviour. The standard generally requires that items in the select list either be aggregates or in the group by list. This is a specific implementation detail/feature of SQLite3 and will (probably) not be portable elsewhere ... -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. >-Original Message- >From: sqlite-users On >Behalf Of Andy Bennett >Sent: Wednesday, 20 November, 2019 12:37 >To: sqlite-users@mailinglists.sqlite.org >Subject: Re: [sqlite] Optimising query with aggregate in subselect. > >Hi, > >> Did you try retrieving the data "directly" or do you need the >> subselect in order to maintain compatibility with other SQL >> dialects that are no longer able to retrieve data from the row >> on which the max was found? > >Thanks Keith! > >I understood that selecting other columns during an aggregate lead to >ill-specific or undefined values in those columns. Does SQLite make more >guarantees than the SQL standard here? Do you have a pointer to the docs >as >I tried and failed to find it in there. > > > >> >> CREATE TABLE entrys >> ( >> logid INTEGER NOT NULL, >> entrynumber INTEGER NOT NULL, >> region TEXT NOT NULL, >> key TEXT NOT NULL, >> timestamp INTEGER NOT NULL, >> PRIMARY KEY (logid, entrynumber) >> ); >> >> CREATE INDEX a on entrys (region, logid, key, entrynumber); >> >> SELECT entrys.logidAS logid, >> max(entrys.entrynumber) AS entrynumber, >> entrys.region AS region, >> entrys.key AS key, >> entrys.timestampAS timestamp >> FROM entrys >>WHERE entrys.region = ? >> AND entrys.key > ? >> AND entrys.logid = ? >> GROUP BY key >> ; >> >> NB: I changed the ill-conceived column names to ones that do >> not require quoting and the identifier quoted items that are not >> column names with parameter markers. > > > > > >Best wishes, >@ndy > >-- >andy...@ashurst.eu.org >http://www.ashurst.eu.org/ >0x7EBA75FF >___ >sqlite-users mailing list >sqlite-users@mailinglists.sqlite.org >http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Optimising query with aggregate in subselect.
"There's a small sidenote (that I'm too lazy too find right now) in the select docs that mentions that, in case of using min or max as aggregate, the non-aggregate columns will come from the row that held the min/max value." Look in https://www.sqlite.org/quirks.html under "6. Aggregate Queries Can Contain Non-Aggregate Result Columns That Are Not In The GROUP BY Clause" and also in https://www.sqlite.org/lang_select.html In section 3 search for: "Side note: Bare columns in an aggregate queries." ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Optimising query with aggregate in subselect.
> On 20 Nov 2019, at 20:37, Andy Bennett wrote: > > Hi, > >> Did you try retrieving the data "directly" or do you need the subselect in >> order to maintain compatibility with other SQL dialects that are no longer >> able to retrieve data from the row on which the max was found? > > Thanks Keith! > > I understood that selecting other columns during an aggregate lead to > ill-specific or undefined values in those columns. Does SQLite make more > guarantees than the SQL standard here? Do you have a pointer to the docs as I > tried and failed to find it in there. There's a small sidenote (that I'm too lazy too find right now) in the select docs that mentions that, in case of using min or max as aggregate, the non-aggregate columns will come from the row that held the min/max value. Kind regards, Merijn signature.asc Description: Message signed with OpenPGP ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Optimising query with aggregate in subselect.
Hi, In past attempts at improving query performance these have been added to encourage it to use an index that it can do a SCAN thru' rather than the table that it would need to do a SEARCH thru'. SQLite is not using the PRIMARY INDEX to immediately locate the appropriate row, but is actually faster when you fake it into using a longer index ? That's weird. There is more than one row returned. Potentially several million. ...so it's quicker to do a linear scan thru' something that's in the same order (table or index) rather than a series of tree accesses for each and every row. Best wishes, @ndy -- andy...@ashurst.eu.org http://www.ashurst.eu.org/ 0x7EBA75FF ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Optimising query with aggregate in subselect.
Hi, Did you try retrieving the data "directly" or do you need the subselect in order to maintain compatibility with other SQL dialects that are no longer able to retrieve data from the row on which the max was found? Thanks Keith! I understood that selecting other columns during an aggregate lead to ill-specific or undefined values in those columns. Does SQLite make more guarantees than the SQL standard here? Do you have a pointer to the docs as I tried and failed to find it in there. CREATE TABLE entrys ( logid INTEGER NOT NULL, entrynumber INTEGER NOT NULL, region TEXT NOT NULL, key TEXT NOT NULL, timestamp INTEGER NOT NULL, PRIMARY KEY (logid, entrynumber) ); CREATE INDEX a on entrys (region, logid, key, entrynumber); SELECT entrys.logidAS logid, max(entrys.entrynumber) AS entrynumber, entrys.region AS region, entrys.key AS key, entrys.timestampAS timestamp FROM entrys WHERE entrys.region = ? AND entrys.key > ? AND entrys.logid = ? GROUP BY key ; NB: I changed the ill-conceived column names to ones that do not require quoting and the identifier quoted items that are not column names with parameter markers. Best wishes, @ndy -- andy...@ashurst.eu.org http://www.ashurst.eu.org/ 0x7EBA75FF ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Optimising query with aggregate in subselect.
Did you try retrieving the data "directly" or do you need the subselect in order to maintain compatibility with other SQL dialects that are no longer able to retrieve data from the row on which the max was found? CREATE TABLE entrys ( logid INTEGER NOT NULL, entrynumber INTEGER NOT NULL, region TEXT NOT NULL, key TEXT NOT NULL, timestamp INTEGER NOT NULL, PRIMARY KEY (logid, entrynumber) ); CREATE INDEX a on entrys (region, logid, key, entrynumber); SELECT entrys.logidAS logid, max(entrys.entrynumber) AS entrynumber, entrys.region AS region, entrys.key AS key, entrys.timestampAS timestamp FROM entrys WHERE entrys.region = ? AND entrys.key > ? AND entrys.logid = ? GROUP BY key ; NB: I changed the ill-conceived column names to ones that do not require quoting and the identifier quoted items that are not column names with parameter markers. -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. >-Original Message- >From: sqlite-users On >Behalf Of Andy Bennett >Sent: Wednesday, 20 November, 2019 09:49 >To: sqlite-users@mailinglists.sqlite.org >Subject: [sqlite] Optimising query with aggregate in subselect. > >Hi, > >I'm trying to implement a "streaming" version of the classic "select the >latest version of a record" query. > > >By "streaming" I mean a query that executes by streaming what it needs >out >of tables and indexes as it needs it rather than using temporary b-trees >or >materializing anything up front. > >I'm looking for a query that I can run and then just consume as many >results as I want without worrying about the size of the entire result >set. > > >Here's the schema I'm working with: > >- >CREATE TABLE "entrys" ("log-id" INTEGER NOT NULL , "entry-number" INTEGER >NOT NULL , "region" TEXT NOT NULL , "key" TEXT NOT NULL , "timestamp" >INTEGER NOT NULL , PRIMARY KEY ("log-id", "entry-number")) > >CREATE UNIQUE INDEX "entrys-log-id-region-key-entry-number" ON "entrys" ( >"log-id" ASC, "region" ASC, "key" ASC, "entry-number" ASC) >- > >There's only a couple of million rows in "entrys" and my query times are >into 2 or 3 seconds of startup time before the first row is returned. > > > >Here's my query: > >- >-- explain query plan >SELECT >"entrys"."log-id" AS "log-id", >"entrys"."entry-number" AS "entry-number", >"entrys"."region" AS "region", >"entrys"."key" AS "key", >"entrys"."timestamp"AS "timestamp" > >FROM > (SELECT > MAX("entry-number") AS "entry-number", > "key" > FROM "entrys" > WHERE > "log-id" = 1 AND > "region" = "user" AND > "entry-number" <= 1700108 > AND key > "G" > GROUP BY "key" > ORDER BY "key" DESC > limit 20 -- (1) > ) AS "specific-entrys" > >INNER JOIN "entrys" >ON >1 = "entrys"."log-id" AND >"specific-entrys"."key" = "entrys"."key" AND >"user" = "entrys"."region" AND >"specific-entrys"."entry-number" = "entrys"."entry-number" >AND "entrys"."key" > "G" > >WHERE >"entrys"."log-id" = 1 > >ORDER BY "key" ASC >; >- > >...which has this query plan in SQLite verson 3.31.0 > >- >QUERY PLAN >|--MATERIALIZE 1 >| `--SEARCH TABLE entrys USING COVERING INDEX >entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key|--SEARCH TABLE entrys USING INDEX entrys-log-id-region-key-entry-number >(log-id=? AND region=? AND key`--SEARCH SUBQUERY 1 AS specific-entrys USING AUTOMATIC COVERING INDEX >(key=?) >- > > >My problem is with the MATERIALIZE. > >The query produces just shy of 2 million rows. > >It takes several seconds to start up but is then pretty quick when >fetching >each row. > >What I want to do is get rid of the startup costs so that I can paginate >it >efficiently. > >If I run the subselect on its own then there is no startup cost. The >results just get streamed straight out of the index. >Its query plan is > >- >QUERY PLAN >`--SEARCH TABLE entrys USING COVERING INDEX >entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key- > > >Is there anything I can do to make the original version of the query >stream >the results in this way? > >The best I have come up with is to insert a LIMIT clause at the point >denoted with "-- (1)". This keeps the subselect small and then >materialising the subselect and generating the automatic covering index >becomes cheap. > >For pagination I then feed in the key from the last row of the previous >batch at the points denoted with "-- (2)" and "-- (3)". > >If I do this then it seems equally cheap to access batches at the start >and >end of the complete result set. The per-query cost is determined by the >batch size as set but the LIMIT clause. > > >However, I have been under the impression that LIMIT is supposed to be a >"debugging" extension to the language and not
Re: [sqlite] Optimising query with aggregate in subselect.
On 20 Nov 2019, at 6:11pm, Andy Bennett wrote: > In past attempts at improving query performance these have been added to > encourage it to use an index that it can do a SCAN thru' rather than the > table that it would need to do a SEARCH thru'. SQLite is not using the PRIMARY INDEX to immediately locate the appropriate row, but is actually faster when you fake it into using a longer index ? That's weird. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Optimising query with aggregate in subselect.
Hi, INNER JOIN "entrys" ON 1 = "entrys"."log-id" AND "specific-entrys"."key" = "entrys"."key" AND "user" = "entrys"."region" AND "specific-entrys"."entry-number" = "entrys"."entry-number" AND "entrys"."key" > "G" I can't solve your problem, but the PRIMARY KEY for "entrys" is ("log-id", "entry-number") . You shouldn't need to match so many different fields when just two of them, which you already have values for, narrow your search down to a single row. Though I may be missing something. In past attempts at improving query performance these have been added to encourage it to use an index that it can do a SCAN thru' rather than the table that it would need to do a SEARCH thru'. I'm pretty happy with the indexes it's currently choosing (apart from the MATERIALIZE). Adding a covering index on timestamp theoretically improves things but doesn't seem to make a (measurable) difference in practice with current data sizes. Best wishes, @ndy -- andy...@ashurst.eu.org http://www.ashurst.eu.org/ 0x7EBA75FF ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Optimising query with aggregate in subselect.
On 20 Nov 2019, at 4:49pm, Andy Bennett wrote: > INNER JOIN "entrys" > ON > 1 = "entrys"."log-id" AND > "specific-entrys"."key" = "entrys"."key" AND > "user" = "entrys"."region" AND > "specific-entrys"."entry-number" = "entrys"."entry-number" > AND "entrys"."key" > "G" I can't solve your problem, but the PRIMARY KEY for "entrys" is ("log-id", "entry-number") . You shouldn't need to match so many different fields when just two of them, which you already have values for, narrow your search down to a single row. Though I may be missing something. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Optimising query with aggregate in subselect.
Hi, ORDER BY "key" DESC This should be ASC, not DESC: I've been working on versions of the query that can go forwards and backwards and made an editor snafu when writing the eMail. Best wishes, @ndy -- andy...@ashurst.eu.org http://www.ashurst.eu.org/ 0x7EBA75FF ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users