The use case behind my questions about spell:suggest-detailed() is this:
We have a large set of XML-encoded versions of historic editions like
the Papers of George Washington, based on printed volumes. We have
converted all of their back-of-the-book indexes to XML. For any given
index reference, we want to know not just the page that the reference
points to, but the precise document that is referenced. (Lots of pages
contain parts of two or more documents, so a page number doesn't map to
documents with a one-to-one relation.) A human reader would do this by
scanning the page to figure out which document was the target of the
page reference. To automate this we need to throw as many search tools
at the problem as MarkLogic provides.
For example, an index reference like
Surveyors: stopped by Indians on Kanawha River
might point to a document where the actual text reads "The Survayors
that went Down the Konaway as report gos was Stopt by the Shawnee
Endiens" (from a letter of 1774 to Washington). A straight word query
won't match the index keywords to that text, but a fuzzy search ought
to.
Strictly speaking, we don't need a dictionary to do this. We can just
compare the text on the referenced page to the substantive word tokens
in the index entry, and note the close matches like surveyors/Survayors
in scoring the document. Hence my question about algorithms based on
usign spell:levenshtein-distance() and spell:double-metaphone() alone,
without a dictionary.
However, given the efficiency of MarkLogic's spelling lookup, we're
thinking that we might get nearly as good real-time performance by just
tokenizing a whole index and converting it into a dictionary, and then
use spell:suggest() or spell:suggest-detailed() to find words in the
document that are close to but not identical to words in the index
entry.
Thanks for the illumination on the meanings of the score values,
David
On Fri, 14 Jan 2011, Walter Underwood wrote:
> word-distance is a weighted edit distance.
>
> distance is a score that combines the different distance measures.
>
> I expect it would be possible to implement a spell suggest algorithm in
> XQuery, but I also expect it would be slower than the built-in spelling
> suggestions.
>
> Are you trying to build spell suggestions or something else?
>
> wunder
> ==
> Walter Underwood
> Lead Engineer
>
> On Jan 13, 2011, at 7:40 PM, David Sewell wrote:
>
> > The output of spell:suggest-detailed() includes four value attributes
> > indicating scores for different variance tests, based (according to the
> > docs) on the raw values of spell:double-metaphone() and
> > spell:levenshtein-distance() as applied to two strings. For example,
> >
> > <spell:suggestion original="konnstitooshion"
> > dictionary="/test/temp-dictionary.xml"
> > xmlns:spell="http://marklogic.com/xdmp/spell">
> > <spell:word distance="138" key-distance="0" word-distance="285"
> > levenshtein-distance="6">constitution</spell:word>
> > </spell:suggestion>
> >
> > @levenshtein-distance is self-explanatory.
> >
> > @key-distance seems to be based directly on the double metaphones (in
> > this case, the double metaphones for "constitution" and
> > "konnstitooshion" are the same)
> >
> > But @distance and @word-distance mean what, exactly?
> >
> > Are algorithms available that would allow calculation of these values
> > using only spell:double-metaphone() and spell:levenshtein-distance(),
> > without needing to use a dictionary?
> >
> > DS
> > --
> > David Sewell, Editorial and Technical Manager
> > ROTUNDA, The University of Virginia Press
> > PO Box 400314, Charlottesville, VA 22904-4314 USA
> > Email: [email protected] Tel: +1 434 924 9973
> > Web: http://rotunda.upress.virginia.edu/
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
>
>
>
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected] Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general