On Tue, Jun 11, 2019 at 2:35 PM Melanie Plageman <melanieplage...@gmail.com> wrote: > Let me try to articulate what I think the bitmap implementation would look > like: > > Before doing chunked hashloop join for any batch, we would need to > know how many tuples are in the outer batch to make the bitmap the > correct size.
I was thinking that we wouldn't need to know this, because if the bitmap is in a file, we can always extend it. To imagine a needlessly dumb implementation, consider: set-bit(i): let b = i / 8 while (b <= length of file in bytes) append '\0' to file read byte b from the file modify the byte you read by setting bit i % 8 write the modified byte back to the file In reality, we'd have some kind of buffer. I imagine locality of reference would be pretty good, because the outer tuples are coming to us in increasing-tuple-number order. If you want to prototype with an in-memory implementation, I'd suggest just pallocing 8kB initially and repallocing when the tuple number gets too big. It'll be sorta inefficient, but who cares? It's certainly way cheaper than an extra pass over the data, and for a POC it should be fine. > Now, I am back to the original problem--how do you know which bit to > set without somehow numbering the tuples with a unique identifier? Is > there anything that uniquely identifies a spill file tuple except its > offset? I don't think so. Approach A hinges on being able to get the tuple number reliably and without contortions, and I have not tried to make that work. So maybe it's really hard or not possible or something. My intuition is that it ought to work, but that and a dollar will get you cup of coffee, so... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company