Hello Alexander,
Here follows a review of your patch.
Hackers,
Seg contrib module contains the same bug in picksplit function as
cube contrib module.
Good catch! :-)
Also, Guttman's split algorithm is not needed in unidimensional case,
because sorting based algorithm is good in this case.
I had some doubts whether this is true in the general case, instead of
the given example. I increased the interval width in your example to
0.25*b instead of 0.00005*b, with the purpose to increase overlaps
between intervals. Though the performance gain was less, it was still
faster than Guttmans algorithm. To make things worse I also tested with
an interval with of 1*b, resulting in a lot of overlaps and compared
several overlap queries. The sorting algorithm was 25% to 40% faster on
searches. Index creation time with the sorting algorithm is also a
fraction of the original creation time.
Since this testing could be part of a review, I looked at the code as
well and listed myself as reviewer on the commitfest.
Comparing with gbt_num_picksplit reveals some differences with sort
array intialization and size, the former's sort array starts at index 1
(FirstOffsetNumber), your implementation starts at 0 for sorting and
hence the size of the sorting array can be one element less. I prefer
your way of sort array initialization; gbt_num_pickplits's use of
FirstOffsetNumber of the qsort array seems to mix a define from the
gist/btree namespace for no reason and might even lead to confusion.
The remaining part of the new picksplit function puts the segs into left
or right, I think the code is easier to understand if there was only one
for loop from i=1 to 1 < maxoff, for the current code I had to verify
that all sort array entries were really used with the two seperate loops
that also skipped the first value. I edited the code a bit, and also
used seg_union to initialize/palloc the datum values. Finally, waste and
firsttime variables were initialized but not used anymore, so removed.
Attached is a revised patch.
regards,
Yeb Havinga
PS: when comparing with gbt_num_picksplit, I noticed that that one does
not update v->spl_ldatum and spl_rdatum to the union datums, but
initializes these to 0 at the beginning and never seems to update them.
Not sure if this is a problem since the num_picksplit stuff seems to
work well.
diff --git a/contrib/seg/seg.c b/contrib/seg/seg.c
index 930a35b..93895ef 100644
--- a/contrib/seg/seg.c
+++ b/contrib/seg/seg.c
@@ -292,38 +292,42 @@ gseg_penalty(GISTENTRY *origentry, GISTENTRY *newentry, float *result)
return (result);
}
+/*
+ * Auxiliary structure for picksplit method.
+ */
+typedef struct
+{
+ int index;
+ SEG *data;
+} PickSplitSortItem;
+/*
+ * Compare function for PickSplitSortItem based on seg_cmp.
+ */
+static int
+sort_item_cmp(const void *a, const void *b)
+{
+ PickSplitSortItem *i1 = (PickSplitSortItem *)a;
+ PickSplitSortItem *i2 = (PickSplitSortItem *)b;
+ return seg_cmp(i1->data, i2->data);
+}
/*
** The GiST PickSplit method for segments
-** We use Guttman's poly time split algorithm
+** Algorithm based on sorting. Incoming array of segs is sorting using seg_cmp
+** function. After that first half of segs goes to the left datum, and the
+** second half of segs goes to the right datum.
*/
GIST_SPLITVEC *
gseg_picksplit(GistEntryVector *entryvec,
GIST_SPLITVEC *v)
{
- OffsetNumber i,
- j;
- SEG *datum_alpha,
- *datum_beta;
+ OffsetNumber i;
SEG *datum_l,
*datum_r;
- SEG *union_d,
- *union_dl,
- *union_dr;
- SEG *inter_d;
- bool firsttime;
- float size_alpha,
- size_beta,
- size_union,
- size_inter;
- float size_waste,
- waste;
- float size_l,
- size_r;
+ PickSplitSortItem *sortItems;
int nbytes;
- OffsetNumber seed_1 = 1,
- seed_2 = 2;
+ OffsetNumber seed_2;
OffsetNumber *left,
*right;
OffsetNumber maxoff;
@@ -332,111 +336,52 @@ gseg_picksplit(GistEntryVector *entryvec,
fprintf(stderr, "picksplit\n");
#endif
- maxoff = entryvec->n - 2;
- nbytes = (maxoff + 2) * sizeof(OffsetNumber);
+ maxoff = entryvec->n - 1;
+ nbytes = (maxoff + 1) * sizeof(OffsetNumber);
+ sortItems = (PickSplitSortItem *)palloc(maxoff * sizeof(PickSplitSortItem));
v->spl_left = (OffsetNumber *) palloc(nbytes);
v->spl_right = (OffsetNumber *) palloc(nbytes);
- firsttime = true;
- waste = 0.0;
-
- for (i = FirstOffsetNumber; i < maxoff; i = OffsetNumberNext(i))
+ /*
+ * Preparing auxiliary array and sorting.
+ */
+ for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
{
- datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[i].key);
- for (j = OffsetNumberNext(i); j <= maxoff; j = OffsetNumberNext(j))
- {
- datum_beta = (SEG *) DatumGetPointer(entryvec->vector[j].key);
-
- /* compute the wasted space by unioning these guys */
- /* size_waste = size_union - size_inter; */
- union_d = seg_union(datum_alpha, datum_beta);
- rt_seg_size(union_d, &size_union);
- inter_d = seg_inter(datum_alpha, datum_beta);
- rt_seg_size(inter_d, &size_inter);
- size_waste = size_union - size_inter;
-
- /*
- * are these a more promising split that what we've already seen?
- */
- if (size_waste > waste || firsttime)
- {
- waste = size_waste;
- seed_1 = i;
- seed_2 = j;
- firsttime = false;
- }
- }
+ sortItems[i - 1].index = i;
+ sortItems[i - 1].data = (SEG *) DatumGetPointer(entryvec->vector[i].key);
}
+ qsort(sortItems, maxoff, sizeof(PickSplitSortItem), sort_item_cmp);
+ seed_2 = maxoff / 2;
left = v->spl_left;
v->spl_nleft = 0;
right = v->spl_right;
v->spl_nright = 0;
- datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[seed_1].key);
- datum_l = seg_union(datum_alpha, datum_alpha);
- rt_seg_size(datum_l, &size_l);
- datum_beta = (SEG *) DatumGetPointer(entryvec->vector[seed_2].key);
- datum_r = seg_union(datum_beta, datum_beta);
- rt_seg_size(datum_r, &size_r);
-
- /*
- * Now split up the regions between the two seeds. An important property
- * of this split algorithm is that the split vector v has the indices of
- * items to be split in order in its left and right vectors. We exploit
- * this property by doing a merge in the code that actually splits the
- * page.
- *
- * For efficiency, we also place the new index tuple in this loop. This is
- * handled at the very end, when we have placed all the existing tuples
- * and i == maxoff + 1.
- */
-
- maxoff = OffsetNumberNext(maxoff);
- for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+ for (i = 0; i < maxoff; i++)
{
- /*
- * If we've already decided where to place this item, just put it on
- * the right list. Otherwise, we need to figure out which page needs
- * the least enlargement in order to store the item.
- */
-
- if (i == seed_1)
+ /* First half of segs goes to the left datum. */
+ if (i < seed_2)
{
- *left++ = i;
- v->spl_nleft++;
- continue;
- }
- else if (i == seed_2)
- {
- *right++ = i;
- v->spl_nright++;
- continue;
- }
-
- /* okay, which page needs least enlargement? */
- datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[i].key);
- union_dl = seg_union(datum_l, datum_alpha);
- union_dr = seg_union(datum_r, datum_alpha);
- rt_seg_size(union_dl, &size_alpha);
- rt_seg_size(union_dr, &size_beta);
-
- /* pick which page to add it to */
- if (size_alpha - size_l < size_beta - size_r)
- {
- datum_l = union_dl;
- size_l = size_alpha;
- *left++ = i;
+ datum_l = seg_union(sortItems[i].data,
+ (i == 0
+ ? sortItems[i].data /* union with self at start */
+ : datum_l) /* union with existing value */ );
+ *left++ = sortItems[i].index;
v->spl_nleft++;
}
+ /* The other half of segs goes to the right datum. */
else
{
- datum_r = union_dr;
- size_r = size_alpha;
- *right++ = i;
+ datum_r = seg_union(sortItems[i].data,
+ (i == seed_2
+ ? sortItems[i].data /* union with self at start */
+ : datum_r) /* union with existing value */ );
+ *right++ = sortItems[i].index;
v->spl_nright++;
}
}
+
*left = *right = FirstOffsetNumber; /* sentinel value, see dosplit() */
v->spl_ldatum = PointerGetDatum(datum_l);
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers