Hi,

I'm trying to implement a "streaming" version of the classic "select the latest version of a record" query.


By "streaming" I mean a query that executes by streaming what it needs out of tables and indexes as it needs it rather than using temporary b-trees or materializing anything up front.

I'm looking for a query that I can run and then just consume as many results as I want without worrying about the size of the entire result set.


Here's the schema I'm working with:

-----
CREATE TABLE "entrys" ("log-id" INTEGER NOT NULL , "entry-number" INTEGER NOT NULL , "region" TEXT NOT NULL , "key" TEXT NOT NULL , "timestamp" INTEGER NOT NULL , PRIMARY KEY ("log-id", "entry-number"))

CREATE UNIQUE INDEX "entrys-log-id-region-key-entry-number" ON "entrys" ( "log-id" ASC, "region" ASC, "key" ASC, "entry-number" ASC)
-----

There's only a couple of million rows in "entrys" and my query times are into 2 or 3 seconds of startup time before the first row is returned.



Here's my query:

-----
-- explain query plan
SELECT
"entrys"."log-id"       AS "log-id",
"entrys"."entry-number" AS "entry-number",
"entrys"."region"       AS "region",
"entrys"."key"          AS "key",
"entrys"."timestamp"    AS "timestamp"

FROM
        (SELECT
        MAX("entry-number") AS "entry-number",
        "key"
        FROM "entrys"
        WHERE
        "log-id" = 1 AND
        "region" = "user" AND
        "entry-number" <= 1700108
        AND key > "G"
        GROUP BY "key"
        ORDER BY "key" DESC
        limit 20 -- (1)
        ) AS "specific-entrys"

INNER JOIN "entrys"
ON
1 = "entrys"."log-id" AND
"specific-entrys"."key" = "entrys"."key" AND
"user" = "entrys"."region" AND
"specific-entrys"."entry-number" = "entrys"."entry-number"
AND "entrys"."key" > "G"

WHERE
"entrys"."log-id" = 1

ORDER BY "key" ASC
;
-----

...which has this query plan in SQLite verson 3.31.0

-----
QUERY PLAN
|--MATERIALIZE 1
| `--SEARCH TABLE entrys USING COVERING INDEX entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?) |--SEARCH TABLE entrys USING INDEX entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?) `--SEARCH SUBQUERY 1 AS specific-entrys USING AUTOMATIC COVERING INDEX (key=?)
-----


My problem is with the MATERIALIZE.

The query produces just shy of 2 million rows.

It takes several seconds to start up but is then pretty quick when fetching each row.

What I want to do is get rid of the startup costs so that I can paginate it efficiently.

If I run the subselect on its own then there is no startup cost. The results just get streamed straight out of the index.
Its query plan is

-----
QUERY PLAN
`--SEARCH TABLE entrys USING COVERING INDEX entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?)
-----


Is there anything I can do to make the original version of the query stream the results in this way?

The best I have come up with is to insert a LIMIT clause at the point denoted with "-- (1)". This keeps the subselect small and then materialising the subselect and generating the automatic covering index becomes cheap.

For pagination I then feed in the key from the last row of the previous batch at the points denoted with "-- (2)" and "-- (3)".

If I do this then it seems equally cheap to access batches at the start and end of the complete result set. The per-query cost is determined by the batch size as set but the LIMIT clause.


However, I have been under the impression that LIMIT is supposed to be a "debugging" extension to the language and not recommended for use in queries that end up in one's program. LIMIT also only hides the latency; it amortises it over each batch, but I still end up with memory requirements in each "client" thread that are larger than I'd like; ideally I'd just store the current row.



Thanks for any tips!




Best wishes,
@ndy

--
andy...@ashurst.eu.org
http://www.ashurst.eu.org/
0x7EBA75FF
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to