Hello again,

It turns out that there actually exists an(other) implementation of
the Daitch-Mokotoff Soundex System which gets it right; the JOS
Soundex Calculator at https://www.jewishgen.org/jos/jossound.htm
Other implementations I have been able to find, like the one in Apache
Commons Coded used in e.g. Elasticsearch, are far from correct.

The source code for the JOS Soundex Calculator is not available, as
far as I can tell, however I have run the complete list of 98412 last
names at
https://raw.githubusercontent.com/philipperemy/name-dataset/master/names_dataset/v1/last_names.all.txt
through the calculator, in order to have a good basis for comparison.

This revealed a few shortcomings in my implementation. In particular I
had to go back to the drawing board in order to handle the dual nature
of "J" correctly. "J" can be either a vowel or a consonant in
Daitch-Mokotoff soundex, and this complicates encoding of the
*previous* character.

I have also done a more thorough review and refactoring of the code,
which should hopefully make things quite a bit more understandable to
others.

The changes are summarized as follows:

* Returns NULL for input without any encodable characters.
* Uses the same "unoffical" rules for "UE" as other implementations.
* Correctly considers "J" as either a vowel or a consonant.
* Corrected encoding for e.g. "HANNMANN".
* Code refactoring and comments for readability.
* Better examples in the documentation.

The implementation is now in correspondence with the JOS Soundex
Calculator for the 98412 last names mentioned above, with only the
following exceptions:

JOS: cedeño 430000 530000
PG:  cedeño 436000 536000
JOS: sadab(khura)  437000
PG:  sadab(khura)  437590

I hope this addition to the fuzzystrmatch extension module will prove
to be useful to others as well!

This is my very first code contribution to PostgreSQL, and I would be
grateful for any advice on how to proceed in order to get the patch
accepted.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..826e529e3e 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,11 +3,12 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
 REGRESS = fuzzystrmatch
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..1b7263c349
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,593 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <d...@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h"
+
+#include <ctype.h>
+#include <string.h>
+
+/* Internal C implementation */
+static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+	text	   *arg = PG_GETARG_TEXT_PP(0);
+	char	   *string,
+			   *tmp_soundex;
+	text	   *soundex;
+
+	/*
+	 * The maximum theoretical soundex size is several KB, however in practice
+	 * anything but contrived synthetic inputs will yield a soundex size of
+	 * less than 100 bytes. We thus allocate and free a temporary work buffer,
+	 * and return only the actual soundex result.
+	 */
+	string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+	tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
+
+	if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))
+	{
+		/* No encodable characters in input. */
+		pfree(tmp_soundex);
+		PG_RETURN_NULL();
+	}
+
+	soundex = cstring_to_text(pg_any_to_server(tmp_soundex, strlen(tmp_soundex), PG_UTF8));
+	pfree(tmp_soundex);
+
+	PG_RETURN_TEXT_P(soundex);
+}
+
+
+typedef dm_node dm_nodes[DM_MAX_NODES];
+typedef dm_node * dm_leaves[DM_MAX_LEAVES];
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+	.soundex_length = 0,
+	.soundex = "000000 ",		/* Six digits + joining space */
+	.is_leaf = 0,
+	.last_update = 0,
+	.code_digit = '\0',
+	.prev_code_digits = {'\0', '\0'},
+	.next_code_digits = {'\0', '\0'},
+	.prev_code_index = 0,
+	.next_code_index = 0,
+	.next_nodes = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static dm_codes end_codes[2] =
+{
+	{
+		"X", "X", "X"
+	}
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+	if (node->last_update < last_update)
+	{
+		node->prev_code_digits[0] = node->next_code_digits[0];
+		node->prev_code_digits[1] = node->next_code_digits[1];
+		node->next_code_digits[0] = '\0';
+		node->next_code_digits[1] = '\0';
+		node->prev_code_index = node->next_code_index;
+		node->next_code_index = 0;
+		node->is_leaf = 0;
+		node->last_update = last_update;
+	}
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+	/* OR in index 1 or 2. */
+	node->next_code_index |= code_index;
+
+	if (!node->next_code_digits[0])
+	{
+		node->next_code_digits[0] = code_digit;
+	}
+	else if (node->next_code_digits[0] != code_digit)
+	{
+		node->next_code_digits[1] = code_digit;
+	}
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
+{
+	if (!node->is_leaf)
+	{
+		node->is_leaf = 1;
+		leaves_next[(*num_leaves_next)++] = node;
+	}
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
+									 dm_node * node, char code_digit)
+{
+	dm_node   **next_nodes;
+	dm_node    *next_node;
+
+	for (next_nodes = node->next_nodes; (next_node = *next_nodes); next_nodes++)
+	{
+		if (next_node->code_digit == code_digit)
+		{
+			return next_node;
+		}
+	}
+
+	next_node = &nodes[(*num_nodes)++];
+	*next_nodes = next_node;
+
+	*next_node = start_node;
+	memcpy(next_node->soundex, node->soundex, sizeof(next_node->soundex));
+	next_node->soundex_length = node->soundex_length;
+	next_node->soundex[next_node->soundex_length++] = code_digit;
+	next_node->code_digit = code_digit;
+	next_node->next_code_index = node->prev_code_index;
+
+	return next_node;
+}
+
+
+/* Update node for next code digit(s). */
+static int
+update_node(dm_nodes nodes, dm_node * node, int *num_nodes,
+			dm_leaves leaves_next, int *num_leaves_next,
+			int letter_no, int prev_code_index, int next_code_index,
+			char *next_code_digits, int digit_no)
+{
+	int			i;
+	char		next_code_digit = next_code_digits[digit_no];
+	int			num_dirty_nodes = 0;
+	dm_node    *dirty_nodes[2];
+
+	initialize_node(node, letter_no);
+
+	if (node->soundex_length == DM_MAX_CODE_DIGITS)
+	{
+		/* Keep completed soundex code. */
+		set_leaf(leaves_next, num_leaves_next, node);
+		return 0;
+	}
+
+	if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+	{
+		/*
+		 * If the sound (vowel / consonant) of this letter encoding doesn't
+		 * correspond to the coding index of the previous letter, we skip this
+		 * letter encoding. Note that currently, only "J" can be either a
+		 * vowel or a consonant.
+		 */
+		return 1;
+	}
+
+	if (next_code_digit == 'X' ||
+		(digit_no == 0 &&
+		 (node->prev_code_digits[0] == next_code_digit ||
+		  node->prev_code_digits[1] == next_code_digit)))
+	{
+		/* The code digit is the same as one of the previous (i.e. not added). */
+		dirty_nodes[num_dirty_nodes++] = node;
+	}
+
+	if (next_code_digit != 'X' &&
+		(digit_no > 0 ||
+		 node->prev_code_digits[0] != next_code_digit ||
+		 node->prev_code_digits[1]))
+	{
+		/* The code digit is different from one of the previous (i.e. added). */
+		node = find_or_create_node(nodes, num_nodes, node, next_code_digit);
+		initialize_node(node, letter_no);
+		dirty_nodes[num_dirty_nodes++] = node;
+	}
+
+	for (i = 0; i < num_dirty_nodes; i++)
+	{
+		/* Add code digit leading to the current node. */
+		add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+		if (next_code_digits[++digit_no])
+		{
+			update_node(nodes, dirty_nodes[i], num_nodes,
+						leaves_next, num_leaves_next,
+						letter_no, prev_code_index, next_code_index,
+						next_code_digits, digit_no);
+		}
+		else
+		{
+			set_leaf(leaves_next, num_leaves_next, dirty_nodes[i]);
+		}
+	}
+
+	return 1;
+}
+
+
+/* Update soundex tree leaf nodes. Return 1 when all nodes are completed. */
+static int
+update_leaves(dm_nodes nodes, int *num_nodes,
+			  dm_leaves leaves[2], int *ix_leaves, int *num_leaves,
+			  int letter_no, dm_codes * codes, dm_codes * next_codes)
+{
+	int			i,
+				j,
+				k,
+				code_index;
+	dm_code    *code,
+			   *next_code;
+	int			num_leaves_next = 0;
+	int			ix_leaves_next = (*ix_leaves + 1) & 1;	/* Alternate ix: 0, 1 */
+	int			finished = 1;
+
+	for (i = 0; i < *num_leaves; i++)
+	{
+		dm_node    *node = leaves[*ix_leaves][i];
+
+		/* One or two alternate code sequences. */
+		for (j = 0; j < 2 && (code = codes[j]) && code[0][0]; j++)
+		{
+			/* Coding for previous letter - before vowel: 1, all other: 2 */
+			int			prev_code_index = (code[0][0] > '1') + 1;
+
+			/* One or two alternate next code sequences. */
+			for (k = 0; k < 2 && (next_code = next_codes[k]) && next_code[0][0]; k++)
+			{
+				/* Determine which code to use. */
+				if (letter_no == 0)
+				{
+					/* This is the first letter. */
+					code_index = 0;
+				}
+				else if (next_code[0][0] <= '1')
+				{
+					/* The next letter is a vowel. */
+					code_index = 1;
+				}
+				else
+				{
+					/* All other cases. */
+					code_index = 2;
+				}
+
+				/* One or two sequential code digits. */
+				if (update_node(nodes, node, num_nodes,
+								leaves[ix_leaves_next], &num_leaves_next,
+								letter_no, prev_code_index, code_index,
+								code[code_index], 0))
+				{
+					finished = 0;
+				}
+			}
+		}
+	}
+
+	*ix_leaves = ix_leaves_next;
+	*num_leaves = num_leaves_next;
+
+	return finished;
+}
+
+
+/* Mapping from ISO8859-1 to ASCII */
+static const char tr_accents_iso8859_1[] =
+/*
+"ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDsaaaaaaeceeeeiiiidnooooo/ouuuuydy";
+
+static char
+unaccent_iso8859_1(unsigned char c)
+{
+	return c >= 192 ? tr_accents_iso8859_1[c - 192] : c;
+}
+
+
+/* Convert an UTF-8 character to ISO-8859-1.
+ * Unconvertable characters are returned as '?'.
+ * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
+ */
+static char
+utf8_to_iso8859_1(char *str, int *ix)
+{
+	const char	unknown = '?';
+	unsigned char c,
+				c2;
+	unsigned int code_point;
+
+	/* First byte. */
+	c = (unsigned char) str[(*ix)++];
+	if (c < 0x80)
+	{
+		/* ASCII code point. */
+		if (c >= '[' && c <= ']')
+		{
+			/* Codes reserved for Ą, Ę, and Ţ/Ț. */
+			return unknown;
+		}
+
+		return c;
+	}
+
+	/* Second byte. */
+	c2 = (unsigned char) str[(*ix)++];
+	if (!c2)
+	{
+		/* The UTF-8 character is cut short (invalid code point). */
+		(*ix)--;
+		return unknown;
+	}
+
+	if (c < 0xE0)
+	{
+		/* Two-byte character. */
+		code_point = ((c & 0x1F) << 6) | (c2 & 0x3F);
+		if (code_point < 0x100)
+		{
+			/* ISO-8859-1 code point. */
+			return code_point;
+		}
+		else if (code_point == 0x0104 || code_point == 0x0105)
+		{
+			/* Ą/ą */
+			return '[';
+		}
+		else if (code_point == 0x0118 || code_point == 0x0119)
+		{
+			/* Ę/ę */
+			return '\\';
+		}
+		else if (code_point == 0x0162 || code_point == 0x0163 ||
+				 code_point == 0x021A || code_point == 0x021B)
+		{
+			/* Ţ/ţ or Ț/ț */
+			return ']';
+		}
+		else
+		{
+			return unknown;
+		}
+	}
+
+	/* Third byte. */
+	if (!str[(*ix)++])
+	{
+		/* The UTF-8 character is cut short (invalid code point). */
+		(*ix)--;
+		return unknown;
+	}
+
+	if (c < 0xF0)
+	{
+		/* Three-byte character. */
+		return unknown;
+	}
+
+	/* Fourth byte. */
+	if (!str[(*ix)++])
+	{
+		/* The UTF-8 character is cut short (invalid code point). */
+		(*ix)--;
+	}
+
+	return unknown;
+}
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(char *str, int *ix)
+{
+	return toupper(unaccent_iso8859_1(utf8_to_iso8859_1(str, ix)));
+}
+
+
+/* Convert input to ASCII, skipping any characters not in [A-\]]. */
+static void
+normalize_input(char *src, char *dst)
+{
+	int			c;
+	int			i = 0,
+				j = 0;
+
+	while ((c = read_char(src, &i)))
+	{
+		if (c >= 'A' && c <= ']')
+		{
+			dst[j++] = c;
+		}
+	}
+
+	dst[j] = '\0';
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static dm_codes *
+read_letter(char *str, int *ix)
+{
+	int			c,
+				cmp;
+	int			i = *ix,
+				j;
+	dm_letter  *letters;
+	dm_codes   *codes;
+
+	/* First letter in sequence. */
+	if (!(c = str[i++]))
+	{
+		return NULL;
+	}
+	letters = &letter_[c - 'A'];
+	codes = letters->codes;
+	*ix = i;
+
+	/* Any subsequent letters in sequence. */
+	while ((letters = letters->letters) && (c = str[i++]))
+	{
+		for (j = 0; (cmp = letters[j].letter); j++)
+		{
+			if (cmp == c)
+			{
+				/* Letter found. */
+				letters = &letters[j];
+				if (letters->codes)
+				{
+					/* Coding for letter sequence found. */
+					codes = letters->codes;
+					*ix = i;
+				}
+				break;
+			}
+		}
+		if (!cmp)
+		{
+			/* The sequence of letters has no coding. */
+			break;
+		}
+	}
+
+	return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
+static char *
+_daitch_mokotoff(char *word, char *soundex, size_t n)
+{
+	int			i = 0,
+				j;
+	int			letter_no = 0;
+	int			ix_leaves = 0;
+	int			num_nodes = 0,
+				num_leaves = 0;
+	dm_codes   *codes,
+			   *next_codes;
+	dm_node    *nodes;
+	dm_leaves  *leaves;
+
+	/* Convert input to encodable ASCII characters, stored in soundex buffer. */
+	normalize_input(word, soundex);
+	if (!soundex[0])
+	{
+		/* No encodable character in input. */
+		return NULL;
+	}
+
+	/* Allocate memory for node tree. */
+	nodes = palloc(sizeof(dm_nodes));
+	leaves = palloc(2 * sizeof(dm_leaves));
+
+	/* Starting point. */
+	nodes[num_nodes++] = start_node;
+	leaves[ix_leaves][num_leaves++] = &nodes[0];
+
+	codes = read_letter(soundex, &i);
+
+	while (codes)
+	{
+		next_codes = read_letter(soundex, &i);
+
+		/* Update leaf nodes. */
+		if (update_leaves(nodes, &num_nodes,
+						  leaves, &ix_leaves, &num_leaves,
+						  letter_no, codes, next_codes ? next_codes : end_codes))
+		{
+			/* All soundex codes are completed to six digits. */
+			break;
+		}
+
+		codes = next_codes;
+		letter_no++;
+	}
+
+	/* Concatenate all generated soundex codes. */
+	for (i = 0, j = 0;
+		 i < num_leaves && j + DM_MAX_CODE_DIGITS + 1 <= n;
+		 i++, j += DM_MAX_CODE_DIGITS + 1)
+	{
+		memcpy(&soundex[j], leaves[ix_leaves][i]->soundex, DM_MAX_CODE_DIGITS + 1);
+	}
+
+	/* Terminate string. */
+	soundex[j - (j != 0)] = '\0';
+
+	pfree(leaves);
+	pfree(nodes);
+
+	return soundex;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.h b/contrib/fuzzystrmatch/daitch_mokotoff.h
new file mode 100644
index 0000000000..8426069825
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.h
@@ -0,0 +1,999 @@
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <d...@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS 6
+#define DM_MAX_ALTERNATE_CODES 5
+#define DM_MAX_NODES 1564
+#define DM_MAX_LEAVES 1250
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+
+typedef char dm_code[2 + 1];	/* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];	/* Start of name, before a vowel, any other */
+
+/* Letter in input sequence */
+struct dm_letter
+{
+	char		letter;			/* Present letter in sequence */
+	struct dm_letter *letters;	/* List of possible successive letters */
+	dm_codes   *codes;			/* Code sequence(s) for complete sequence */
+};
+
+/* Node in soundex code tree */
+struct dm_node
+{
+	int			soundex_length; /* Length of generated soundex code */
+	char		soundex[DM_MAX_CODE_DIGITS + 1];	/* Soundex code */
+	int			is_leaf;		/* Candidate for complete soundex code */
+	int			last_update;	/* Letter number for last update of node */
+	char		code_digit;		/* Last code digit, 0 - 9 */
+
+	/*
+	 * One or two alternate code digits leading to this node. If there are two
+	 * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+	 * back to the same node.
+	 */
+	char		prev_code_digits[2];
+	/* One or two alternate code digits moving forward. */
+	char		next_code_digits[2];
+	/* ORed together code index(es) used to reach current node. */
+	int			prev_code_index;
+	int			next_code_index;
+	/* Nodes branching out from this node. */
+	struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+static dm_codes codes_0_1_X[2] =
+{
+	{
+		"0", "1", "X"
+	}
+};
+static dm_codes codes_0_7_X[2] =
+{
+	{
+		"0", "7", "X"
+	}
+};
+static dm_codes codes_0_X_X[2] =
+{
+	{
+		"0", "X", "X"
+	}
+};
+static dm_codes codes_1_1_X[2] =
+{
+	{
+		"1", "1", "X"
+	}
+};
+static dm_codes codes_1_X_X[2] =
+{
+	{
+		"1", "X", "X"
+	}
+};
+static dm_codes codes_1_X_X_or_4_4_4[2] =
+{
+	{
+		"1", "X", "X"
+	},
+	{
+		"4", "4", "4"
+	}
+};
+static dm_codes codes_2_43_43[2] =
+{
+	{
+		"2", "43", "43"
+	}
+};
+static dm_codes codes_2_4_4[2] =
+{
+	{
+		"2", "4", "4"
+	}
+};
+static dm_codes codes_3_3_3[2] =
+{
+	{
+		"3", "3", "3"
+	}
+};
+static dm_codes codes_3_3_3_or_4_4_4[2] =
+{
+	{
+		"3", "3", "3"
+	},
+	{
+		"4", "4", "4"
+	}
+};
+static dm_codes codes_4_4_4[2] =
+{
+	{
+		"4", "4", "4"
+	}
+};
+static dm_codes codes_5_54_54[2] =
+{
+	{
+		"5", "54", "54"
+	}
+};
+static dm_codes codes_5_5_5[2] =
+{
+	{
+		"5", "5", "5"
+	}
+};
+static dm_codes codes_5_5_5_or_45_45_45[2] =
+{
+	{
+		"5", "5", "5"
+	},
+	{
+		"45", "45", "45"
+	}
+};
+static dm_codes codes_5_5_5_or_4_4_4[2] =
+{
+	{
+		"5", "5", "5"
+	},
+	{
+		"4", "4", "4"
+	}
+};
+static dm_codes codes_5_5_X[2] =
+{
+	{
+		"5", "5", "X"
+	}
+};
+static dm_codes codes_66_66_66[2] =
+{
+	{
+		"66", "66", "66"
+	}
+};
+static dm_codes codes_6_6_6[2] =
+{
+	{
+		"6", "6", "6"
+	}
+};
+static dm_codes codes_7_7_7[2] =
+{
+	{
+		"7", "7", "7"
+	}
+};
+static dm_codes codes_8_8_8[2] =
+{
+	{
+		"8", "8", "8"
+	}
+};
+static dm_codes codes_94_94_94_or_4_4_4[2] =
+{
+	{
+		"94", "94", "94"
+	},
+	{
+		"4", "4", "4"
+	}
+};
+static dm_codes codes_9_9_9[2] =
+{
+	{
+		"9", "9", "9"
+	}
+};
+static dm_codes codes_X_X_6_or_X_X_X[2] =
+{
+	{
+		"X", "X", "6"
+	},
+	{
+		"X", "X", "X"
+	}
+};
+
+/* Coding for alternative following letters in sequence. */
+static dm_letter letter_A[] =
+{
+	{
+		'I', NULL, codes_0_1_X
+	},
+	{
+		'J', NULL, codes_0_1_X
+	},
+	{
+		'U', NULL, codes_0_7_X
+	},
+	{
+		'Y', NULL, codes_0_1_X
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_CH[] =
+{
+	{
+		'S', NULL, codes_5_54_54
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_CS[] =
+{
+	{
+		'Z', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_CZ[] =
+{
+	{
+		'S', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_C[] =
+{
+	{
+		'H', letter_CH, codes_5_5_5_or_4_4_4
+	},
+	{
+		'K', NULL, codes_5_5_5_or_45_45_45
+	},
+	{
+		'S', letter_CS, codes_4_4_4
+	},
+	{
+		'Z', letter_CZ, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_DR[] =
+{
+	{
+		'S', NULL, codes_4_4_4
+	},
+	{
+		'Z', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_DS[] =
+{
+	{
+		'H', NULL, codes_4_4_4
+	},
+	{
+		'Z', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_DZ[] =
+{
+	{
+		'H', NULL, codes_4_4_4
+	},
+	{
+		'S', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_D[] =
+{
+	{
+		'R', letter_DR, NULL
+	},
+	{
+		'S', letter_DS, codes_4_4_4
+	},
+	{
+		'T', NULL, codes_3_3_3
+	},
+	{
+		'Z', letter_DZ, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_E[] =
+{
+	{
+		'I', NULL, codes_0_1_X
+	},
+	{
+		'J', NULL, codes_0_1_X
+	},
+	{
+		'U', NULL, codes_1_1_X
+	},
+	{
+		'Y', NULL, codes_0_1_X
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_F[] =
+{
+	{
+		'B', NULL, codes_7_7_7
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_I[] =
+{
+	{
+		'A', NULL, codes_1_X_X
+	},
+	{
+		'E', NULL, codes_1_X_X
+	},
+	{
+		'O', NULL, codes_1_X_X
+	},
+	{
+		'U', NULL, codes_1_X_X
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_K[] =
+{
+	{
+		'H', NULL, codes_5_5_5
+	},
+	{
+		'S', NULL, codes_5_54_54
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_M[] =
+{
+	{
+		'N', NULL, codes_66_66_66
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_N[] =
+{
+	{
+		'M', NULL, codes_66_66_66
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_O[] =
+{
+	{
+		'I', NULL, codes_0_1_X
+	},
+	{
+		'J', NULL, codes_0_1_X
+	},
+	{
+		'Y', NULL, codes_0_1_X
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_P[] =
+{
+	{
+		'F', NULL, codes_7_7_7
+	},
+	{
+		'H', NULL, codes_7_7_7
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_R[] =
+{
+	{
+		'S', NULL, codes_94_94_94_or_4_4_4
+	},
+	{
+		'Z', NULL, codes_94_94_94_or_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SCHTC[] =
+{
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SCHTSC[] =
+{
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SCHTS[] =
+{
+	{
+		'C', letter_SCHTSC, NULL
+	},
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SCHT[] =
+{
+	{
+		'C', letter_SCHTC, NULL
+	},
+	{
+		'S', letter_SCHTS, NULL
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SCH[] =
+{
+	{
+		'D', NULL, codes_2_43_43
+	},
+	{
+		'T', letter_SCHT, codes_2_43_43
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SC[] =
+{
+	{
+		'H', letter_SCH, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SHC[] =
+{
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SHTC[] =
+{
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SHTS[] =
+{
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SHT[] =
+{
+	{
+		'C', letter_SHTC, NULL
+	},
+	{
+		'S', letter_SHTS, NULL
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SH[] =
+{
+	{
+		'C', letter_SHC, NULL
+	},
+	{
+		'D', NULL, codes_2_43_43
+	},
+	{
+		'T', letter_SHT, codes_2_43_43
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_STC[] =
+{
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_STR[] =
+{
+	{
+		'S', NULL, codes_2_4_4
+	},
+	{
+		'Z', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_STSC[] =
+{
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_STS[] =
+{
+	{
+		'C', letter_STSC, NULL
+	},
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_ST[] =
+{
+	{
+		'C', letter_STC, NULL
+	},
+	{
+		'R', letter_STR, NULL
+	},
+	{
+		'S', letter_STS, NULL
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SZC[] =
+{
+	{
+		'S', NULL, codes_2_4_4
+	},
+	{
+		'Z', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_SZ[] =
+{
+	{
+		'C', letter_SZC, NULL
+	},
+	{
+		'D', NULL, codes_2_43_43
+	},
+	{
+		'T', NULL, codes_2_43_43
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_S[] =
+{
+	{
+		'C', letter_SC, codes_2_4_4
+	},
+	{
+		'D', NULL, codes_2_43_43
+	},
+	{
+		'H', letter_SH, codes_4_4_4
+	},
+	{
+		'T', letter_ST, codes_2_43_43
+	},
+	{
+		'Z', letter_SZ, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_TC[] =
+{
+	{
+		'H', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_TR[] =
+{
+	{
+		'S', NULL, codes_4_4_4
+	},
+	{
+		'Z', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_TSC[] =
+{
+	{
+		'H', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_TS[] =
+{
+	{
+		'C', letter_TSC, NULL
+	},
+	{
+		'H', NULL, codes_4_4_4
+	},
+	{
+		'Z', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_TTC[] =
+{
+	{
+		'H', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_TTSC[] =
+{
+	{
+		'H', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_TTS[] =
+{
+	{
+		'C', letter_TTSC, NULL
+	},
+	{
+		'Z', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_TT[] =
+{
+	{
+		'C', letter_TTC, NULL
+	},
+	{
+		'S', letter_TTS, codes_4_4_4
+	},
+	{
+		'Z', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_TZ[] =
+{
+	{
+		'S', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_T[] =
+{
+	{
+		'C', letter_TC, codes_4_4_4
+	},
+	{
+		'H', NULL, codes_3_3_3
+	},
+	{
+		'R', letter_TR, NULL
+	},
+	{
+		'S', letter_TS, codes_4_4_4
+	},
+	{
+		'T', letter_TT, NULL
+	},
+	{
+		'Z', letter_TZ, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_U[] =
+{
+	{
+		'E', NULL, codes_0_1_X
+	},
+	{
+		'I', NULL, codes_0_1_X
+	},
+	{
+		'J', NULL, codes_0_1_X
+	},
+	{
+		'Y', NULL, codes_0_1_X
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_ZDZ[] =
+{
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_ZD[] =
+{
+	{
+		'Z', letter_ZDZ, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_ZHDZ[] =
+{
+	{
+		'H', NULL, codes_2_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_ZHD[] =
+{
+	{
+		'Z', letter_ZHDZ, NULL
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_ZH[] =
+{
+	{
+		'D', letter_ZHD, codes_2_43_43
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_ZSC[] =
+{
+	{
+		'H', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_ZS[] =
+{
+	{
+		'C', letter_ZSC, NULL
+	},
+	{
+		'H', NULL, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_Z[] =
+{
+	{
+		'D', letter_ZD, codes_2_43_43
+	},
+	{
+		'H', letter_ZH, codes_4_4_4
+	},
+	{
+		'S', letter_ZS, codes_4_4_4
+	},
+	{
+		'\0'
+	}
+};
+static dm_letter letter_[] =
+{
+	{
+		'A', letter_A, codes_0_X_X
+	},
+	{
+		'B', NULL, codes_7_7_7
+	},
+	{
+		'C', letter_C, codes_5_5_5_or_4_4_4
+	},
+	{
+		'D', letter_D, codes_3_3_3
+	},
+	{
+		'E', letter_E, codes_0_X_X
+	},
+	{
+		'F', letter_F, codes_7_7_7
+	},
+	{
+		'G', NULL, codes_5_5_5
+	},
+	{
+		'H', NULL, codes_5_5_X
+	},
+	{
+		'I', letter_I, codes_0_X_X
+	},
+	{
+		'J', NULL, codes_1_X_X_or_4_4_4
+	},
+	{
+		'K', letter_K, codes_5_5_5
+	},
+	{
+		'L', NULL, codes_8_8_8
+	},
+	{
+		'M', letter_M, codes_6_6_6
+	},
+	{
+		'N', letter_N, codes_6_6_6
+	},
+	{
+		'O', letter_O, codes_0_X_X
+	},
+	{
+		'P', letter_P, codes_7_7_7
+	},
+	{
+		'Q', NULL, codes_5_5_5
+	},
+	{
+		'R', letter_R, codes_9_9_9
+	},
+	{
+		'S', letter_S, codes_4_4_4
+	},
+	{
+		'T', letter_T, codes_3_3_3
+	},
+	{
+		'U', letter_U, codes_0_X_X
+	},
+	{
+		'V', NULL, codes_7_7_7
+	},
+	{
+		'W', NULL, codes_7_7_7
+	},
+	{
+		'X', NULL, codes_5_54_54
+	},
+	{
+		'Y', NULL, codes_1_X_X
+	},
+	{
+		'Z', letter_Z, codes_4_4_4
+	},
+	{
+		'a', NULL, codes_X_X_6_or_X_X_X
+	},
+	{
+		'e', NULL, codes_X_X_6_or_X_X_X
+	},
+	{
+		't', NULL, codes_3_3_3_or_4_4_4
+	},
+	{
+		'\0'
+	}
+};
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..3e97e000ee
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,288 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <d...@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+	chomp;
+	my ($letters, $codes) = split(/\s+/);
+	my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+	# Find alternate code transitions for calculation of storage.
+	# The first character can never yield more than two alternate codes, and is not considered here.
+	if (@codes > 1) {
+		for my $j (1..2) {
+			for my $i (0..1) {
+				my ($a, $b) = (substr($codes[$i][$j], -1, 1), substr($codes[($i + 1)%2][$j], 0, 1));
+				$alternates{$a}{$b} = 1 if $a ne $b;
+			}
+		}
+	}
+	my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+	my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+	$codes{$key} = $val;
+
+	for my $letter (split(/,/, $letters)) {
+		my $ref = $table->[0];
+		# Link each character to the next in the letter combination.
+		my @c = split(//, $letter);
+		my $last_c = pop(@c);
+		for my $c (@c) {
+			$ref->{$c} //= [ {}, undef ];
+			$ref->{$c}[0] //= {};
+			$ref = $ref->{$c}[0];
+		}
+		# The sound code for the letter combination is stored at the last character.
+		$ref->{$last_c}[1] = $key;
+	}
+}
+close(DATA);
+
+# Find the maximum number of alternate codes in one position.
+my $alt_x = $alternates{"X"};
+while (my ($k, $v) = each %alternates) {
+	if (defined delete $v->{"X"}) {
+		for my $x (keys %$alt_x) {
+			$v->{$x} = 1;
+		}
+	}
+}
+my $max_alt = (reverse sort (2, map { scalar keys %$_ } values %alternates))[0];
+
+# The maximum number of nodes and leaves in the soundex tree.
+# These are safe estimates, but in practice somewhat higher than the actual maximums.
+# Note that the first character can never yield more than two alternate codes,
+# hence the calculations are performed as sums of two subtrees.
+my $digits = 6;
+# Number of nodes (sum of geometric progression).
+my $max_nodes = 2 + 2*(1 - $max_alt**($digits - 1))/(1 - $max_alt);
+# Number of leaves (exponential of base number).
+my $max_leaves = 2*$max_alt**($digits - 2);
+
+print <<EOF;
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS $digits
+#define DM_MAX_ALTERNATE_CODES $max_alt
+#define DM_MAX_NODES $max_nodes
+#define DM_MAX_LEAVES $max_leaves
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+
+typedef char dm_code[2 + 1];	/* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];	/* Start of name, before a vowel, any other */
+
+/* Letter in input sequence */
+struct dm_letter
+{
+	char		letter;			/* Present letter in sequence */
+	struct dm_letter *letters;	/* List of possible successive letters */
+	dm_codes   *codes;			/* Code sequence(s) for complete sequence */
+};
+
+/* Node in soundex code tree */
+struct dm_node
+{
+	int			soundex_length; /* Length of generated soundex code */
+	char		soundex[DM_MAX_CODE_DIGITS + 1];	/* Soundex code */
+	int			is_leaf;		/* Candidate for complete soundex code */
+	int			last_update;	/* Letter number for last update of node */
+	char		code_digit;		/* Last code digit, 0 - 9 */
+
+	/*
+	 * One or two alternate code digits leading to this node. If there are two
+	 * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+	 * back to the same node.
+	 */
+	char		prev_code_digits[2];
+	/* One or two alternate code digits moving forward. */
+	char		next_code_digits[2];
+	/* ORed together code index(es) used to reach current node. */
+	int			prev_code_index;
+	int			next_code_index;
+	/* Nodes branching out from this node. */
+	struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+	print "static dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+	my ($ref, $letter) = @_;
+
+	my @letters = ();
+
+	my $h = $ref->[0];
+	for my $key (sort keys %$h) {
+		$ref = $h->{$key};
+		my $children = "NULL";
+		if (defined $ref->[0]) {
+			$children = "letter_$letter$key";
+			hash2code($ref, "$letter$key");
+		}
+		my $codes = $ref->[1] // "NULL";
+		push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+	}
+
+	print "static dm_letter letter_$letter\[\] =\n{\n";
+	for (@letters) {
+		print "$_,\n";
+	}
+	print "\t{\n\t\t'\\0'\n\t}\n";
+	print "};\n";
+}
+
+hash2code($table, '');
+
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# X = NC (not coded)
+#
+# Note that the following letters are coded with substitute letters
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" below does not correspond to the table referred to above,
+# however it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+
+__DATA__
+AI,AJ,AY				0,1,X
+AU						0,7,X
+a						X,X,6|X,X,X
+A						0,X,X
+B						7,7,7
+CHS						5,54,54
+CH						5,5,5|4,4,4
+CK						5,5,5|45,45,45
+CZ,CS,CSZ,CZS			4,4,4
+C						5,5,5|4,4,4
+DRZ,DRS					4,4,4
+DS,DSH,DSZ				4,4,4
+DZ,DZH,DZS				4,4,4
+D,DT					3,3,3
+EI,EJ,EY				0,1,X
+EU						1,1,X
+e						X,X,6|X,X,X
+E						0,X,X
+FB						7,7,7
+F						7,7,7
+G						5,5,5
+H						5,5,X
+IA,IE,IO,IU				1,X,X
+I						0,X,X
+J						1,X,X|4,4,4
+KS						5,54,54
+KH						5,5,5
+K						5,5,5
+L						8,8,8
+MN						66,66,66
+M						6,6,6
+NM						66,66,66
+N						6,6,6
+OI,OJ,OY				0,1,X
+O						0,X,X
+P,PF,PH					7,7,7
+Q						5,5,5
+RZ,RS					94,94,94|4,4,4
+R						9,9,9
+SCHTSCH,SCHTSH,SCHTCH	2,4,4
+SCH						4,4,4
+SHTCH,SHCH,SHTSH		2,4,4
+SHT,SCHT,SCHD			2,43,43
+SH						4,4,4
+STCH,STSCH,SC			2,4,4
+STRZ,STRS,STSH			2,4,4
+ST						2,43,43
+SZCZ,SZCS				2,4,4
+SZT,SHD,SZD,SD			2,43,43
+SZ						4,4,4
+S						4,4,4
+TCH,TTCH,TTSCH			4,4,4
+TH						3,3,3
+TRZ,TRS					4,4,4
+TSCH,TSH				4,4,4
+TS,TTS,TTSZ,TC			4,4,4
+TZ,TTZ,TZS,TSZ			4,4,4
+t						3,3,3|4,4,4
+T						3,3,3
+UI,UJ,UY,UE				0,1,X
+U						0,X,X
+V						7,7,7
+W						7,7,7
+X						5,54,54
+Y						1,X,X
+ZDZ,ZDZH,ZHDZH			2,4,4
+ZD,ZHD					2,43,43
+ZH,ZS,ZSCH,ZSH			4,4,4
+Z						4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..d778facd49 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,224 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)
 
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff 
+-----------------
+ 054795
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff 
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff 
+-----------------
+ 793000
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff 
+-----------------
+ 587943 587433
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff 
+-----------------
+ 665600
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff 
+-----------------
+ 596740 496740
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff 
+-----------------
+ 595400 495400
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff 
+-----------------
+ 586660
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff 
+-----------------
+ 673950
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff 
+-----------------
+ 798600
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff 
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff 
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff 
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff 
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff 
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff 
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff                     
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff 
+-----------------
+ 945744
+(1 row)
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff 
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff 
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff 
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff 
+-----------------
+ 095600
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff 
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff 
+-----------------
+ 079600
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff 
+-----------------
+ 564000 540000
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+       daitch_mokotoff       
+-----------------------------
+ 794640 794400 746400 744000
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff 
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff 
+-----------------
+ 364000 464000
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff              
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff 
+-----------------
+ 743000
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff 
+-----------------
+ 710000
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff 
+-----------------
+ 566600
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                     daitch_mokotoff                     
+---------------------------------------------------------
+ 651900 654900 654190 654490 645190 645490 641900 644900
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                     daitch_mokotoff                     
+---------------------------------------------------------
+ 059400 054000 054940 054400 045940 045400 049400 044000
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                     daitch_mokotoff                     
+---------------------------------------------------------
+ 794575 794574 794750 794740 745750 745740 747500 747400
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..b9d7b229a3
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
similarity index 92%
rename from contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
rename to contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
index 41de9d949b..2a8a100699 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
@@ -42,3 +42,7 @@ LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
 CREATE FUNCTION dmetaphone_alt (text) RETURNS text
 AS 'MODULE_PATHNAME', 'dmetaphone_alt'
 LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..c83572fe76 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,60 @@ SELECT metaphone('GUMBO', 4);
 
 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..08781778f8 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,101 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>
 
+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   Any alternative soundex codes are separated by space, which makes the returned
+   text suited for use in Full Text Search, see <xref linkend="textsearch"/> and
+   <xref linkend="functions-textsearch"/>.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_name(v_name text) RETURNS text AS $$
+  SELECT string_agg(daitch_mokotoff(n), ' ')
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', soundex_name(v_name))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT to_tsquery('simple', quote_literal(soundex_name(v_name)))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no recalculation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('John Doe');
+INSERT INTO s VALUES ('Jane Roe');
+INSERT INTO s VALUES ('Public John Q.');
+INSERT INTO s VALUES ('George Best');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+</programlisting>
+ </sect2>
+
 </sect1>

Reply via email to