On 11/2/21 13:04, Hayk Manukyan wrote:
Tomas Vondra
 > Are you suggesting those are not the actual best/worst cases and we
 > should use some other indexes? If yes, which ones?

I would say yes.
In my case I am not querying only sequence column.
I have the following cases which I want to optimize.
1. Select * from Some_table where job = <somthing> and nlp = <something> and year = <something> and *scan_id = <something> * 2. Select * from Some_table where job = <somthing> and nlp = <something> and year = <something> and *Issue_flag = <something> * 3. Select * from Some_table where job = <somthing> and nlp = <something> and year = <something> and *sequence = <something> * Those are queries that my app send to db that is why I said that from *read perspective* our *best case* is 3 separate indexes for *(job, nlp, year, sequence)* AND *(job, nlp, year, Scan_ID)* and *(job, nlp, year,  issue_flag)*  and any other solution like  (job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year ) INCLUDE(sequence, Scan_ID, issue_flag)  OR just (job, nlp, year) can be considered as*worst case *

I already explained why using INCLUDE in this case is the wrong thing to do, it'll harm performance compared to just defining a regular index.

I will remind that in real world scenario I have ~50m rows and about *~5k rows for each (job, nlp, year )*

Well, maybe this is the problem. We have 50M rows, but the three columns have too many distinct values - (job, nlp, year) defines ~50M groups, so there's only a single row per group. That'd explain why the two indexes perform almost equally.

So I guess you need to modify the data generator so that the data set is more like the case you're trying to improve.

From *write perspective* as far as we want to have only one index our*best case* can be considered any of *(job, nlp, year, sequence, Scan_ID, issue_flag)* OR *(job, nlp, year ) INCLUDE(sequence, Scan_ID, issue_flag) * and the*worst case* will be having 3 separate queries like in read perspective (job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp, year,  issue_flag)


Maybe. It's true a write with three indexes will require modification to three leaf pages (on average). With a single index we have to modify just one leaf page, depending on where the row gets routed.

But with the proposed "merged" index, the row will have to be inserted into three smaller trees. If the trees are large enough, they won't fit into a single leaf page (e.g. the 5000 index tuples is guaranteed to need many pages, even if you use some smart encoding). So the write will likely need to modify at least 3 leaf pages, getting much closer to three separate indexes. At which point you could just use three indexes.

So I think the comparison that we did is not right because we are comparing different/wrong things.
 > For right results we need to compare this two cases when we are doing
random queries with 1,2,3  and random writes.


I'm not going to spend any more time on tweaking the benchmark, but if you tweak it to demonstrate the difference / benefits I'll run it again on my machine etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to