Just about as soon as I sent the above, I realized that it's unlikely to make sense in the real world in a row-store. If the goal is to keep the top-25 results and trim the rest, what happens when values are added/modified/deleted? You now *have to go look at all of the data you aren't caching in the index. *Unless you can *guarantee* that data is entered in perfect order, and/or never changes, I don't think what I'm looking for is likely to make sense.
On Fri, Apr 5, 2024 at 11:27 AM Morris de Oryx <morrisdeo...@gmail.com> wrote: > Looking for an index to support top-n searches, were n has a fixed maximum > > Recently, I've been looking at strategies to handle top-n queries in > Postgres. In my current cases, we've got definition tables, and very large > related tables. Here's a stripped-down example, the real tables are much > wider. > > CREATE TABLE IF NOT EXISTS data.inventory ( > id uuid NOT NULL DEFAULT NULL PRIMARY KEY, > inv_id uuid NOT NULL DEFAULT NULL > ); > > > CREATE TABLE IF NOT EXISTS data.scan ( > id uuid NOT NULL DEFAULT NULL PRIMARY KEY, > inv_id uuid NOT NULL DEFAULT NULL > scan_dts_utc timestamp NOT NULL DEFAULT NOW(); -- We run out > servers on UTC > ); > > Every item in inventory is scanned when it passes through various steps in > a clean-dispatch-arrive-use-clean sort of a work cycle. The ratio between > inventory and scan is 1:0-n, where n can be virtually any number. In > another table pair like this, the average is 1:1,000. In the inventory > example, it's roughly 0:200,000. The distribution of related row counts is > all over the map. The reasons behind these ratios sometimes map to valid > processes, and sometimes are a consequence of some low-quality data leaking > into the system. In the case of inventory, the results make sense. In our > case: > > * The goal value for n is often 1, and not more than up to 25. > > * Depending on the tables, the % of rows that are discarded because > they're past the 25th most recent record is 97% or more. > > * Partial indexes do not work as well on huge tables as I hoped. The same > data copied via a STATEMENT trigger into a thin, subsetted table is much > faster. I think this has to do with the increase in random disk access > required for a table and/or index with more pages spread around on the disk. > > * We can't filter in advance by *any* reasonable date range. Could get 25 > scans on one item in an hour, or a year, or five years, or never. > > We're finding that we need the top-n records more and more often, returned > quickly. This gets harder to do as the table(s) grow. > > SELECT id, scan_dts_utc > FROM data.scan > WHERE inv_id = 'b7db5d06-8275-224d-a38a-ac263dc1c767' curve. > ORDER BY scan_dts_utc DESC > LIMIT 25; -- Full search product might be 0, 200K, or anything > in-between. Not on a bell curve. > > A compound index works really well to optimize these kinds of searches: > > CREATE INDEX scan_inv_id_scan_time_utc_dts_idx > ON ascendco.analytic_scan (inv_id, scan_time_utc_dts DESC); > > What I'm wondering is if there is some index option, likely not with a > B-tree, that can *automatically* enforce a maximum-length list of top > values, based on a defined sort > > CREATE INDEX scan_inv_id_scan_time_utc_dts_idx > ON ascendco.analytic_scan (inv_id, scan_time_utc_dts DESC) -- > This defines the ordering > LIMIT 25; -- > This sets the hard max for n > > The goal is to have an automatically maintained list of the top values > *in* the index itself. In the right situations (like ours), this reduces > the index size by 20x or more. Smaller index, faster results. And, since > the index is on the source table, the row references are already there. > (Something I lose when maintaining this by hand in a side/shadow/top table.) > > I've looked at a ton of plans, and Postgres *clearly* goes to a lot of > effort to recognize and optimize top-n searches already. That's > encouraging, as it suggests that the planner takes LIMIT into account. > (I've picked up already that maintaining the purity of the planner and > executor abstractions is a core value to the project.) > > And, for sure, I can build and maintain my own custom, ordered list in > various ways. None of them seem like they can possibly rival the trimming > behavior being handled by an index. > > I'm out over my skis here, but I'm intuiting that this might be a job for > one of the multi-value/inverted index types/frameworks. I tried some > experiments, but only got worse results. > > Hope that reads as understandable...grateful for any suggestions. >