Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Joshua Tolley Thu, 06 Nov 2008 16:45:19 -0800

On Thu, Nov 6, 2008 at 5:31 PM, Lawrence, Ramon <[EMAIL PROTECTED]> wrote:
>> -----Original Message-----
>> > Minor question on this patch. AFAICS there is another patch that
> seems
>> > to be aiming at exactly the same use case. Jonah's Bloom filter
> patch.
>> >
>> > Shouldn't we have a dust off to see which one is best? Or at least a
>> > discussion to test whether they overlap? Perhaps you already did
> that
>> > and I missed it because I'm not very tuned in on this thread.
>> >
>> > --
>> >  Simon Riggs           www.2ndQuadrant.com
>> >  PostgreSQL Training, Services and Support
>>
>> We haven't had that discussion AFAIK, and definitely should. First
>> glance suggests they could coexist peacefully, with proper coaxing. If
>> I understand things properly, Jonah's patch filters tuples early in
>> the join process, and this patch tries to ensure that hash join
>> batches are kept in RAM when they're most likely to be used. So
>> they're orthogonal in purpose, and the patches actually apply *almost*
>> cleanly together. Jonah, any comments? If I continue to have some time
>> to devote, and get through all I think I can do to review this patch,
>> I'll gladly look at Jonah's too, FWIW.
>>
>> - Josh
>
> The skew patch and bloom filter patch are orthogonal and can both be
> applied.  The bloom filter patch is a great idea, and it is used in many
> other database systems.  You can use the TPC-H data set to demonstrate
> that the bloom filter patch will significantly improve performance of
> multi-batch joins (with or without data skew).
>
> Any query that filters a build table before joining on the probe table
> will show improvements with a bloom filter.  For example,
>
> select * from customer, orders where customer.c_nationkey = 10 and
> customer.c_custkey = orders.o_custkey
>
> The bloom filter on customer would allow us to avoid probing with orders
> tuples that cannot possibly find a match due to the selection criteria.
> This is especially beneficial for multi-batch joins where an orders
> tuple must be written to disk if its corresponding customer batch is not
> the in-memory batch.
>
> I have no experience reviewing patches, but I would be happy to help
> contribute/review the bloom filter patch as best I can.
>
> --
> Dr. Ramon Lawrence
> Assistant Professor, Department of Computer Science, University of
> British Columbia Okanagan
> E-mail: [EMAIL PROTECTED]
>


I've no patch review experience, either -- this is my first one. See
http://wiki.postgresql.org/wiki/Reviewing_a_Patch for details on what
a reviewer ought to do in general; various patch review discussions on
the -hackers list have also proven helpful. As regards this patch
specifically, it seems we could merge the two patches into one and
consider them together. However, the bloom filter patch is listed as a
"Work in Progress" on
http://wiki.postgresql.org/wiki/CommitFest_2008-11. Perhaps it needs
more work before being considered seriously? Jonah, what do you think
would be most helpful?

- Josh / eggyknap

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

Reply via email to