Problem: we need to create a full-text search on a text that may include various spellings of Hawaiian names. Properly spelled, many Hawaiian place names include the "okina" or glottal stop. Technically it is Unicode U+02BB but is often represented by a single curly quote, U+2018, or just ASCII apostrophe. For example, the island of Oahu may be spelled

Oahu
O'ahu [apostrophe]
O‘ahu [curly quote, U+2018]
Oʻahu [okina, U+02BB]

Now suppose all of those spellings are found in our data, and we want to implement a search that will match all of them when a user searches on "oahu".

I can't think of any reasonable way to do this in MarkLogic.

cts:word-query("oahu", 
('case-insensitive','diacritic-insensitive','punctuation-insensitive'))

matches only "Oahu". All the other spellings are tokenized on the special characters and are therefore not matched.

Is there any obvious way to do this, short of duplicating the text with spellings normalized?

--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected]   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to