On Tue, Jun 4, 2019 at 2:47 PM Melanie Plageman <melanieplage...@gmail.com> wrote: > Oops! You are totally right. > I will amend the idea: > For each chunk on the inner side, loop through both the original batch > file and the unmatched outer tuples file created for the last chunk. > Emit any matches and write out any unmatched tuples to a new unmatched > outer tuples file. > > I think, in the worst case, if no tuples from the outer have a match, > you end up writing out all of the outer tuples for each chunk on the > inner side. However, using the match bit in the tuple header solution > would require this much writing. > Probably the bigger problem is that in this worst case you would also > need to read double the number of outer tuples for each inner chunk. > > However, in the best case it seems like it would be better than the > match bit/write everything from the outer side out solution.
I guess so, but the downside of needing to read twice as many outer tuples for each inner chunk seems pretty large. It would be a lot nicer if we could find a way to store the matched-bits someplace other than where we are storing the tuples, what Thomas called a bits-in-order scheme, because then the amount of additional read and write I/O would be tiny -- one bit per tuple doesn't add up very fast. In the scheme you propose here, I think that after you read the original outer tuples for each chunk and the unmatched outer tuples for each chunk, you'll have to match up the unmatched tuples to the original tuples, probably by using memcmp() or something. Otherwise, when a new match occurs, you won't know which tuple should now not be emitted into the new unmatched outer tuples file that you're going to produce. So I think what's going to happen is that you'll read the original batch file, then read the unmatched tuples file and use that to set or not set a bit on each tuple in memory, then do the real work setting more bits, then write out a new unmatched-tuples file with the tuples that still don't have the bit set. So your unmatched tuple file is basically a list of tuple identifiers in the least compact form imaginable: the tuple is identified by the entire tuple contents. That doesn't seem very appealing, although I expect that it would still win for some queries. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company