Merge Joins require us to potentially Mark and Restore positions in the tuples arriving from executor sub-nodes.
This currently means that if the tuples arrive from a Sort node, as they often do in an MJ, the sort node will be instructed to prepare a random access version of the sort result. That requires a full final merge of the output, so as to allow rewinding the input when a Restore operation is called. An MJ doesn't actually need random access, it just needs to be able to rewind. The question is: how far does it need to rewind? In many cases, the Restore operation moves back a small number of tuples, with a unique inner scan requiring a rewind of just one tuple. It would certainly be cheaper, in most cases, for the Sort node to maintain a variable size rewind buffer, where the history of prior tuples is truncated each time we perform a Mark operation. This could be implemented as a modified Tuplestore that could then be trimmed down each time a Mark operation took place. If the tuplestore has not yet spilled to disk this could be a trivial operation. Doing that would almost completely remove the overhead of the final merge step in the sort. The final merge often doubles elapsed time in cases where the sort is larger than work_mem, which it often is. Implementing the variable mark/restore buffer as a dumb Tuplestore would mean that the space usage of the Sort could in worst case go as high as x2 total space. The worst case is where the inner scan is all a single value. The best case is where the inner scan is sufficiently unique over all its values that it never writes back to disk at all. So a further refinement of this idea would be to simply defer the final merge operation for the sort until the history required for the Mark operation exceeded, say, 10% of the sort size. That would then be sufficient to improve performance for most common cases, without risking massive space overflow for large and highly non-unique data. There's no problem with running the final merge slightly later than before; everything's still there to allow it. Reusing space in the tuplestore is also straightforward since that's exactly what the final merge already does, so some rework of that code should be sufficient. This is a separate, but related idea of being able to avoid mark/restores completely when the outer scan is provably unique. I'm not intending to implement this idea just yet, but it seemed worth recording since it occurred to me - and discussing it as a TODO item. Comments? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings