Re: [HACKERS] Fix for seg picksplit function

Yeb Havinga Fri, 05 Nov 2010 08:47:58 -0700

Hello Alexander,

Here follows a review of your patch.

Hackers,
Seg contrib module contains the same bug in picksplit function ascube contrib module.

Good catch! :-)

Also, Guttman's split algorithm is not needed in unidimensional case,because sorting based algorithm is good in this case.

I had some doubts whether this is true in the general case, instead ofthe given example. I increased the interval width in your example to0.25*b instead of 0.00005*b, with the purpose to increase overlapsbetween intervals. Though the performance gain was less, it was stillfaster than Guttmans algorithm. To make things worse I also tested withan interval with of 1*b, resulting in a lot of overlaps and comparedseveral overlap queries. The sorting algorithm was 25% to 40% faster onsearches. Index creation time with the sorting algorithm is also afraction of the original creation time.

Since this testing could be part of a review, I looked at the code aswell and listed myself as reviewer on the commitfest.

Comparing with gbt_num_picksplit reveals some differences with sortarray intialization and size, the former's sort array starts at index 1(FirstOffsetNumber), your implementation starts at 0 for sorting andhence the size of the sorting array can be one element less. I preferyour way of sort array initialization; gbt_num_pickplits's use ofFirstOffsetNumber of the qsort array seems to mix a define from thegist/btree namespace for no reason and might even lead to confusion.

The remaining part of the new picksplit function puts the segs into leftor right, I think the code is easier to understand if there was only onefor loop from i=1 to 1 < maxoff, for the current code I had to verifythat all sort array entries were really used with the two seperate loopsthat also skipped the first value. I edited the code a bit, and alsoused seg_union to initialize/palloc the datum values. Finally, waste andfirsttime variables were initialized but not used anymore, so removed.


Attached is a revised patch.

regards,
Yeb Havinga

PS: when comparing with gbt_num_picksplit, I noticed that that one doesnot update v->spl_ldatum and spl_rdatum to the union datums, butinitializes these to 0 at the beginning and never seems to update them.Not sure if this is a problem since the num_picksplit stuff seems towork well.

diff --git a/contrib/seg/seg.c b/contrib/seg/seg.c
index 930a35b..93895ef 100644
--- a/contrib/seg/seg.c
+++ b/contrib/seg/seg.c
@@ -292,38 +292,42 @@ gseg_penalty(GISTENTRY *origentry, GISTENTRY *newentry, float *result)
 	return (result);
 }
 
+/*
+ * Auxiliary structure for picksplit method.
+ */
+typedef struct
+{
+	int index;
+	SEG *data;
+} PickSplitSortItem;
 
+/*
+ * Compare function for PickSplitSortItem based on seg_cmp.
+ */
+static int
+sort_item_cmp(const void *a, const void *b)
+{
+	PickSplitSortItem *i1 = (PickSplitSortItem *)a;
+	PickSplitSortItem *i2 = (PickSplitSortItem *)b;
+	return seg_cmp(i1->data, i2->data);
+}
 
 /*
 ** The GiST PickSplit method for segments
-** We use Guttman's poly time split algorithm
+** Algorithm based on sorting. Incoming array of segs is sorting using seg_cmp
+** function. After that first half of segs goes to the left datum, and the
+** second half of segs goes to the right datum.
 */
 GIST_SPLITVEC *
 gseg_picksplit(GistEntryVector *entryvec,
 			   GIST_SPLITVEC *v)
 {
-	OffsetNumber i,
-				j;
-	SEG		   *datum_alpha,
-			   *datum_beta;
+	OffsetNumber i;
 	SEG		   *datum_l,
 			   *datum_r;
-	SEG		   *union_d,
-			   *union_dl,
-			   *union_dr;
-	SEG		   *inter_d;
-	bool		firsttime;
-	float		size_alpha,
-				size_beta,
-				size_union,
-				size_inter;
-	float		size_waste,
-				waste;
-	float		size_l,
-				size_r;
+	PickSplitSortItem	*sortItems;
 	int			nbytes;
-	OffsetNumber seed_1 = 1,
-				seed_2 = 2;
+	OffsetNumber seed_2;
 	OffsetNumber *left,
 			   *right;
 	OffsetNumber maxoff;
@@ -332,111 +336,52 @@ gseg_picksplit(GistEntryVector *entryvec,
 	fprintf(stderr, "picksplit\n");
 #endif
 
-	maxoff = entryvec->n - 2;
-	nbytes = (maxoff + 2) * sizeof(OffsetNumber);
+	maxoff = entryvec->n - 1;
+	nbytes = (maxoff + 1) * sizeof(OffsetNumber);
+	sortItems = (PickSplitSortItem *)palloc(maxoff * sizeof(PickSplitSortItem));
 	v->spl_left = (OffsetNumber *) palloc(nbytes);
 	v->spl_right = (OffsetNumber *) palloc(nbytes);
 
-	firsttime = true;
-	waste = 0.0;
-
-	for (i = FirstOffsetNumber; i < maxoff; i = OffsetNumberNext(i))
+	/*
+	 * Preparing auxiliary array and sorting.
+	 */
+	for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
 	{
-		datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[i].key);
-		for (j = OffsetNumberNext(i); j <= maxoff; j = OffsetNumberNext(j))
-		{
-			datum_beta = (SEG *) DatumGetPointer(entryvec->vector[j].key);
-
-			/* compute the wasted space by unioning these guys */
-			/* size_waste = size_union - size_inter; */
-			union_d = seg_union(datum_alpha, datum_beta);
-			rt_seg_size(union_d, &size_union);
-			inter_d = seg_inter(datum_alpha, datum_beta);
-			rt_seg_size(inter_d, &size_inter);
-			size_waste = size_union - size_inter;
-
-			/*
-			 * are these a more promising split that what we've already seen?
-			 */
-			if (size_waste > waste || firsttime)
-			{
-				waste = size_waste;
-				seed_1 = i;
-				seed_2 = j;
-				firsttime = false;
-			}
-		}
+		sortItems[i - 1].index = i;
+		sortItems[i - 1].data = (SEG *) DatumGetPointer(entryvec->vector[i].key);
 	}
+	qsort(sortItems, maxoff, sizeof(PickSplitSortItem), sort_item_cmp);
+	seed_2 = maxoff / 2;
 
 	left = v->spl_left;
 	v->spl_nleft = 0;
 	right = v->spl_right;
 	v->spl_nright = 0;
 
-	datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[seed_1].key);
-	datum_l = seg_union(datum_alpha, datum_alpha);
-	rt_seg_size(datum_l, &size_l);
-	datum_beta = (SEG *) DatumGetPointer(entryvec->vector[seed_2].key);
-	datum_r = seg_union(datum_beta, datum_beta);
-	rt_seg_size(datum_r, &size_r);
-
-	/*
-	 * Now split up the regions between the two seeds.	An important property
-	 * of this split algorithm is that the split vector v has the indices of
-	 * items to be split in order in its left and right vectors.  We exploit
-	 * this property by doing a merge in the code that actually splits the
-	 * page.
-	 *
-	 * For efficiency, we also place the new index tuple in this loop. This is
-	 * handled at the very end, when we have placed all the existing tuples
-	 * and i == maxoff + 1.
-	 */
-
-	maxoff = OffsetNumberNext(maxoff);
-	for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+	for (i = 0; i < maxoff; i++)
 	{
-		/*
-		 * If we've already decided where to place this item, just put it on
-		 * the right list.	Otherwise, we need to figure out which page needs
-		 * the least enlargement in order to store the item.
-		 */
-
-		if (i == seed_1)
+		/* First half of segs goes to the left datum. */
+		if (i < seed_2)
 		{
-			*left++ = i;
-			v->spl_nleft++;
-			continue;
-		}
-		else if (i == seed_2)
-		{
-			*right++ = i;
-			v->spl_nright++;
-			continue;
-		}
-
-		/* okay, which page needs least enlargement? */
-		datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[i].key);
-		union_dl = seg_union(datum_l, datum_alpha);
-		union_dr = seg_union(datum_r, datum_alpha);
-		rt_seg_size(union_dl, &size_alpha);
-		rt_seg_size(union_dr, &size_beta);
-
-		/* pick which page to add it to */
-		if (size_alpha - size_l < size_beta - size_r)
-		{
-			datum_l = union_dl;
-			size_l = size_alpha;
-			*left++ = i;
+			datum_l = seg_union(sortItems[i].data,
+								(i == 0
+								 ? sortItems[i].data /* union with self at start */
+								 : datum_l) /* union with existing value */ );
+			*left++ = sortItems[i].index;
 			v->spl_nleft++;
 		}
+		/* The other half of segs goes to the right datum. */
 		else
 		{
-			datum_r = union_dr;
-			size_r = size_alpha;
-			*right++ = i;
+			datum_r = seg_union(sortItems[i].data,
+								(i == seed_2
+								 ? sortItems[i].data /* union with self at start */
+								 : datum_r) /* union with existing value */ );
+			*right++ = sortItems[i].index;
 			v->spl_nright++;
 		}
 	}
+
 	*left = *right = FirstOffsetNumber; /* sentinel value, see dosplit() */
 
 	v->spl_ldatum = PointerGetDatum(datum_l);

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for seg picksplit function

Reply via email to