Darkhan, Great work! As a former archaeologist your comment about Kazakh being agglutinative reminded me of ancient Sumerian which has a similar structure.
You might find some interest among philologists and ancient near eastern historians for your work. Philip On Wed, Apr 8, 2026 at 9:56 AM Darkhan <[email protected]> wrote: > Thanks for the suggestion! > > I did look into Snowball early on. There is actually a Turkish stemmer in > Snowball already and Turkish is structurally very similar to Kazakh (both > agglutinative Turkic languages). But honestly the Turkish one is pretty > lobotomized, it only handles nominal suffixes and doesn’t account for verb > morphology at all. The author even mentions this in the comments. So it > kind of works for basic noun cases but falls apart on real text. > > The reason I went with a standalone extension is that Kazakh has suffix > chains where vowel harmony interacts with each layer and you need > context-aware decisions, not just stripping patterns from the end of the > word. My stemmer uses a penalty-scored BFS over possible suffix > decompositions instead of the linear step-by-step stripping that Snowball > does. With 5-6 suffixes stacked on one word you really need to evaluate > multiple decomposition paths to find the best one. > > That said contributing a simplified Kazakh stemmer to Snowball is > something I’d like to explore longer term. Even a basic version would be > better than nothing which is what exists today. Would need to figure out > how much of the BFS logic can fit into the Snowball language or if a > simpler approach gets close enough. > > Appreciate the pointer! > > Darkhan > > On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <[email protected]> > wrote: > >> On 4/5/26 3:32 PM, Darkhan wrote: >> > Hi all, >> > >> > I built pg_kazsearch, a PostgreSQL extension that adds full-text search >> > support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or >> > stop word list available in PostgreSQL, so anyone searching Kazakh text >> is >> > stuck with trigram matching or application-level workarounds. >> > >> > Kazakh is agglutinative — a single word can carry 5-6 suffixes, which >> makes >> > standard search approaches miss most relevant results. pg_kazsearch >> > provides a custom Kazakh stemmer (core written in Rust), a stop word >> list, >> > and a text search dictionary that plugs into the standard PostgreSQL FTS >> > infrastructure — GIN indexes, ts_rank, phrase search all work out of the >> > box. >> > >> > I tested it on a dataset of 3,000 real Kazakh news articles. On the same >> > query, pg_kazsearch returns 61 relevant articles vs 1 with trigram >> search, >> > with a 23% improvement in recall overall. >> > >> > You can install it with a single command via deb package or Docker >> image, >> > no compilation needed. >> > >> > Repo: https://github.com/darkhanakh/pg-kazsearch >> > >> > I'd appreciate any feedback, especially from anyone working on text >> search >> > internals or with experience supporting non-Latin or agglutinative >> > languages in PostgreSQL. >> > >> > Thanks, Darkhan >> > >> >> Hello, >> >> Thanks for your work. >> I don't know anything about Kazakh. >> >> But have you try to add it to Snowball stemmer [1] ? >> As Postgres uses it, you have more chances to have Kazakh >> supported in future versions. >> >> >> 1: https://github.com/snowballstem/snowball >> >> -- >> Adrien NAYRAT >> https://pro.anayrat.info >> >
