I figured that a thesaurus approach might be needed. Or we could do some
data preprocessing to duplicate words with Hawaiian spelling with
normalized equivalents.

This does raise the more general question of strategies for implementing
searches to match word variants with or without punctuation. You don't
have to be working with Hawaiian spelling for this to be an issue:
matching hyphenated words when the query term is unhyphenated is
precisely the same thing.

David


On Thu, 26 Aug 2010, Jason Hunter wrote:

> Hi David,
>
> Unfortunately there's not yet an "okina-insensitive" query option.  :)
>
> If you run a punctuation-insensitive search for the phrase "O ahu" it will 
> match your three spellings (as well as O-ahu or O.ahu).  It'll be good to 
> have the fast phrases index enabled.
>
> To eliminate the (probably rare) spurious punctuation hit you could do a 
> cts:or-query of the various spellings and set the query terms to 
> punctuation-sensitive.  Then MarkLogic will use indexes to match the phrase 
> "O ahu" and internal filtering to verify the punctuation is correct.  Since 
> it'll be pretty rare to have any other punctuation in there, your filter hit 
> ratio will be quite high so performance will remain good.
>
> You can use a custom Hawaiian word thesaurus to take care of the cts:or-query 
> expansion.
>
> -jh-
>
> On Aug 26, 2010, at 7:23 PM, David Sewell wrote:
>
> > Problem: we need to create a full-text search on a text that may include
> > various spellings of Hawaiian names. Properly spelled, many Hawaiian
> > place names include the "okina" or glottal stop. Technically it is
> > Unicode U+02BB but is often represented by a single curly quote, U+2018,
> > or just ASCII apostrophe. For example, the island of Oahu may be spelled
> >
> > Oahu
> > O'ahu [apostrophe]
> > O‘ahu [curly quote, U+2018]
> > Oʻahu [okina, U+02BB]
> >
> > Now suppose all of those spellings are found in our data, and we want to
> > implement a search that will match all of them when a user searches on
> > "oahu".
> >
> > I can't think of any reasonable way to do this in MarkLogic.
> >
> > cts:word-query("oahu", 
> > ('case-insensitive','diacritic-insensitive','punctuation-insensitive'))
> >
> > matches only "Oahu". All the other spellings are tokenized on the
> > special characters and are therefore not matched.
> >
> > Is there any obvious way to do this, short of duplicating the text with
> > spellings normalized?
> >
> > --
> > David Sewell, Editorial and Technical Manager
> > ROTUNDA, The University of Virginia Press
> > PO Box 400314, Charlottesville, VA 22904-4314 USA
> > Email: [email protected]   Tel: +1 434 924 9973
> > Web: 
> > http://rotunda.upress.virginia.edu/_______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>

-- 
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected]   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to