Has this been completed? TODO item? ---------------------------------------------------------------------------
Lawrence, Ramon wrote: > > From: Tom Lane [mailto:t...@sss.pgh.pa.us] > > I was intending to do it the other way, actually. An extra field in > > HashPath hardly costs anything. The other reason for it is that there > > are other possible uses for knowing whether a hash will be > multi-batch. > > (For example, if we were prepared to tell the executor that it *must* > > keep the hash to one batch, we could assume that the sort order of the > > left input is preserved. I haven't looked into the risks/benefits of > > that too much, but it's been in the back of the mind for a long time.) > > Having the number of batches in HashPath could be potentially useful for > a variety of reasons. For our research, we have added an nbatch > variable in both HashPath and HashJoin. Having it in HashJoin is useful > as we modified EXPLAIN to output the number of batches. There are costs > in putting an nbatch variable in HashPath as the system may set this > variable potentially hundreds/thousands of times during costing and does > not (currently) use it until you convert the chosen HashPath to a plan. > > > I'd be more inclined to deal with the issue by trying to establish a > > "safety margin" in the estimate of whether the hash will go > multi-batch. > > IOW we should disuse_physical_tlist if the hash is estimated to be > close > > to but still within one batch. > > Our experiments with large TPC-H 1GB joins show that it is almost always > better to not use physical_tlists if the number of batches is > 1. > There is a noticeable (approximately 5-15%) improvement when using > physical_tlists for in-memory joins. For batches of size 2, it > sometimes can go either way depending how many attributes are projected > out of the outer relation. Using physical_tlists may be better even for > batches of size 2 if most of the attributes of the outer relation are > kept. For a larger number of batches, the extra I/O cost significantly > dominates over the physical_tlist optimization. Performance of > multi-batch joins may improve 50% or more by disabling the optimization. > > It is possible to create a "safety margin" by having > ExecChooseHashTableSize() return the value > inner_rel_bytes/hash_table_bytes which represents the fraction of the > memory available that the inner relation is expected to consume. You > can then make decisions based on that. However, this is only as good > as the inner relation size estimate and especially for large queries, > the estimate may be quite inaccurate. A more robust solution could > examine the "width" of the path and the "width" of the relation combined > with the number of batches to see if projecting early would be worth it. > It may be best to keep it simple and just use number of batches > 1 as a > criteria and instead focus on examining issues with inaccurate join size > estimates. > > -- > Dr. Ramon Lawrence > Assistant Professor, Department of Computer Science, University of > British Columbia Okanagan > E-mail: ramon.lawre...@ubc.ca > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <br...@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers