Good morning,

Mike, thanks for the free consulting :)

I agree, in the case of English and French, I don't think I need to be concerned with the tokenizing.

I may end up submitting an RFE on this (something similar to the thesaurus expansion API sounds like a good approach to me) since future projects and other customers may benefit, and in the meantime I will be fine with the help you and Danny provided (speaking of which, as always, a thank-you to you and everyone at Mark Logic for the support in getting the most out of your product).

Best,
Shannon

On Oct 9, 2008, at 4:46 PM, Michael Blakeley wrote:

Shannon,

Hmm... I think we may be talking at cross-purposes. As I mentioned yesterday, I'm a little concerned about maintaining a distinction between cts:query term-level language, vs the language passed to cts:tokenize() in lp:get-cts-query-element().

When I mentioned the idea of adding another parameter to lp:get-cts- query(), I was thinking of the cts:tokenize() option. But I think I jumped to a conclusion there. French and English aren't that different, and I can't think of a place where the cts:tokenize() language would matter (as opposed to, for example, Chinese).

Based on this latest email, and the exchange with Danny, you'd like to pass multiple languages to lp:get-cts-query(), and get back an internally-expanded or-query for every language for each input term. This would work somewhat like thesaurus expansion. Is that correct?

If so, this does seem like a useful RFE (for lib-parser, or for MarkLogic Server). But you can also do this in your own code fairly easily:

xquery version "0.9-ml"

define function expand-languages($query as cts:query, $lang as xs:string*)
as cts:query
{
 if (empty($lang)) then $query else
 typeswitch($query)
 case cts:and-query return cts:and-query(
   for $q in cts:and-query-queries($query)
   return expand-languages($q, $lang),
   cts:and-query-options($query)
 )
 case cts:word-query return cts:or-query((
   let $opts :=
     cts:word-query-options($query)[not(starts-with(., 'lang='))]
   let $word := cts:word-query-text($query)
   for $i in $lang
   return cts:word-query($word, ($opts, concat('lang=', $i)))
 ))
 default return error(
   'UNIMPLEMENTED', text { 'no support for', xdmp:describe($query) } )
}

expand-languages(
 cts:and-query((
   cts:word-query('foo'),
   cts:word-query('bar')
 )), ('en', 'fr') )

=>
cts:and-query((cts:or-query((cts:word-query("foo", ("lang=en"), 1), cts:word-query("foo", ("lang=fr"), 1))), cts:or-query((cts:word- query("bar", ("lang=en"), 1), cts:word-query("bar", ("lang=fr"), 1)))), ())

Keep expanding the typeswitch to cover all the possibilities.

-- Mike

Shannon wrote:
Thank you, Mike--that's so very agreeable--yes, per-query control language awareness would be most useful! Given a form that accepts a query string input and a language selector that includes an "all" option, the desired behavior is language-specific tokenization, in this case, for English and French; Danny demystified the search recall logic, but lib-parser doesn't provide the full support, yet, to get the most out of the French language module--currently I'm using the overloaded lp:get-cts- query() that grabs $options at the 3rd argument; maybe another overload with a 4th argument, or perhaps take the hint from the lang option if supplied?
Thanks,
Shannon
On Oct 8, 2008, at 5:29 PM, Michael Blakeley wrote:
Today, lib-parser calls cts:tokenize() without the language argument, so it always uses the database default language. So the tokenization is language-aware, but there's no per-query control over which language it uses.

If per-query control over language awareness would be useful, how would you like to express it? As another (optional) argument to lp:get-cts-query()?

I'm a little concerned about maintaining a distinction between cts:query term-level language, vs the language passed to cts:tokenize() in lp:get-cts-query-element(). But if it's useful functionality, let's figure out how to add it.

-- Mike

Shannon wrote:
Hi,
Does anyone know whether lib-parser has support for language- aware tokenization, for lp:get-cts-query specifically?
Thanks,
__________________________________________________
Shannon Scott Shiflett, programmer/analyst with ROTUNDA,
The University of Virginia Press, Charlottesville, VA  USA
http://rotunda.upress.virginia.edu
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
__________________________________________________
Shannon Scott Shiflett, programmer/analyst with ROTUNDA,
The University of Virginia Press, Charlottesville, VA  USA
http://rotunda.upress.virginia.edu
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

__________________________________________________
Shannon Scott Shiflett, programmer/analyst with ROTUNDA,
The University of Virginia Press, Charlottesville, VA  USA
http://rotunda.upress.virginia.edu

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to