I saw no replies to this.


Simon Riggs wrote:
> Merge Joins require us to potentially Mark and Restore positions in the
> tuples arriving from executor sub-nodes.
> This currently means that if the tuples arrive from a Sort node, as they
> often do in an MJ, the sort node will be instructed to prepare a random
> access version of the sort result. That requires a full final merge of
> the output, so as to allow rewinding the input when a Restore operation
> is called.
> An MJ doesn't actually need random access, it just needs to be able to
> rewind. The question is: how far does it need to rewind? In many cases,
> the Restore operation moves back a small number of tuples, with a unique
> inner scan requiring a rewind of just one tuple. 
> It would certainly be cheaper, in most cases, for the Sort node to
> maintain a variable size rewind buffer, where the history of prior
> tuples is truncated each time we perform a Mark operation. This could be
> implemented as a modified Tuplestore that could then be trimmed down
> each time a Mark operation took place. If the tuplestore has not yet
> spilled to disk this could be a trivial operation.
> Doing that would almost completely remove the overhead of the final
> merge step in the sort. The final merge often doubles elapsed time in
> cases where the sort is larger than work_mem, which it often is.
> Implementing the variable mark/restore buffer as a dumb Tuplestore would
> mean that the space usage of the Sort could in worst case go as high as
> x2 total space. The worst case is where the inner scan is all a single
> value. The best case is where the inner scan is sufficiently unique over
> all its values that it never writes back to disk at all. 
> So a further refinement of this idea would be to simply defer the final
> merge operation for the sort until the history required for the Mark
> operation exceeded, say, 10% of the sort size. That would then be
> sufficient to improve performance for most common cases, without risking
> massive space overflow for large and highly non-unique data. There's no
> problem with running the final merge slightly later than before;
> everything's still there to allow it. Reusing space in the tuplestore is
> also straightforward since that's exactly what the final merge already
> does, so some rework of that code should be sufficient.
> This is a separate, but related idea of being able to avoid
> mark/restores completely when the outer scan is provably unique. 
> I'm not intending to implement this idea just yet, but it seemed worth
> recording since it occurred to me - and discussing it as a TODO item.
> Comments?
> -- 
>   Simon Riggs             
>   EnterpriseDB   http://www.enterprisedb.com
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [EMAIL PROTECTED] so that your
       message can get through to the mailing list cleanly

Reply via email to