Re: [MarkLogic Dev General] en/em dashes punctuation?

Danny Sokolsky Fri, 27 Jan 2012 13:22:25 -0800

For the empty word-query, the search api allows you to configure the behavior 
to be no-results (the default) or all-results.  With all-results, and empty 
term will give an empty and-query, which is defined to match everything.  For 
example:


import module namespace search = 
  "http://marklogic.com/appservices/search";
  at "/MarkLogic/appservices/search/search.xqy";
  
  search:parse("", <search:options 
xmlns="http://marklogic.com/appservices/search";><term>
   <empty apply="no-results" />
   <term-option>diacritic-insensitive</term-option>
   <term-option>unwildcarded</term-option>
 </term></search:options>)

Now if you have punctuation in there, that is not an empty term, so I am not 
sure why you are seeing an empty word-query for that.  For example, the 
following:

search:parse("+")

returns

<cts:word-query qtextref="cts:text" xmlns:cts="http://marklogic.com/cts";>
  <cts:text>+</cts:text>
</cts:word-query>

So maybe I am not understanding you?

-Danny

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Will Thompson
Sent: Friday, January 27, 2012 12:45 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] en/em dashes punctuation?

Danny - Yes, a good alternative to using replace().

I'm not sure if this sounds like a reasonable feature request, but I have some 
beef with the way search:parse behaves. First, I didn't realize that an empty 
word query - cts:word-query("") - would return zero results; I assumed it would 
return everything. But given that, if you search:parse with the 
case-insensitive option, then any string input with floating punctuation will 
return a cts:word-query("&punctuation;",("case-insensitive")), equivalent to 
the empty word query and and-ed with the rest of the parsed query will always 
return nothing. It's an edge case, but seems undesirable in any scenario.

-Will


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Danny Sokolsky
Sent: Friday, January 27, 2012 10:25 AM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] en/em dashes punctuation?

Just a thought here:  Since those values are coming out of a lexicon (I am 
assuming), maybe your javascript code that displays it in the browser can 
remove the unwanted characters (and maybe lower-case them?) before it gives the 
suggestion to the ui?  That way people can still search for those characters by 
typing them in.

-Danny

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Will Thompson
Sent: Thursday, January 26, 2012 7:20 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] en/em dashes punctuation?

Okay, very cool. Thanks.

-Will

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Michael Blakeley
Sent: Thursday, January 26, 2012 7:17 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] en/em dashes punctuation?

The initial release did not include query-eval for direct cts:query output. I 
added that a bit later, to make it easier to get started.

But the default query-eval module is still meant to be modified and customized 
for your application. It takes some shortcuts: for example, it assumes that 
foo=bar maps to cts:element-value-query(xs:QName('foo'), 'bar'). A 
sophisticated application might use a lookup table to map codes to QNames, and 
possibly mix in other options like collation, language, case-sensitivity, etc.

If you customize query-eval in interesting or useful ways, I am open to adding 
more sample evaluators to github.

-- Mike
 
On 26 Jan 2012, at 19:04 , Will Thompson wrote:

> Thanks Mike - Your parser was on my radar, but I did not realize it returns 
> ML query syntax (I thought you had to DIY to get it from your AST to ML). 
> 
> As a quick fix I may have to just 
> replace($querystring,"&endash;|&emdash;,""), but I will definitely give xqysp 
> a second look.
> 
> -Will
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Michael Blakeley
> Sent: Thursday, January 26, 2012 6:46 PM
> To: General MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] en/em dashes punctuation?
> 
> Will, you could also try https://github.com/mblakele/xqysp
> 
> Here are some tests comparing search:parse output with the query-eval module 
> that is included as an example in xqysp. Since you have access to the 
> abstract syntax tree from the parser, you can also customize the cts:query 
> output to suit your needs. Integration with the search API is easy: instead 
> of calling search:search, pass the cts:query to search:resolve.
> 
> import module namespace qe="com.blakeley.xqysp.query-eval" at 
> "query-eval.xqy";
> import module namespace search = "http://marklogic.com/appservices/search";
>     at "/MarkLogic/appservices/search/search.xqy";
> 
> for $q in ('foo-bar', 'foo - bar', 'foo–bar', 'foo – bar')
> return element test {
>  attribute query { $q },
>  search:parse($q),
>  qe:parse($q) }
> =>
> <test query="foo-bar">
>  <cts:word-query qtextref="cts:text" xmlns:cts="http://marklogic.com/cts";>
>    <cts:text>foo-bar</cts:text>
>  </cts:word-query>
>  <cts:word-query xmlns:cts="http://marklogic.com/cts";>
>    <cts:text xml:lang="en">foo-bar</cts:text>
>  </cts:word-query>
> </test>
> <test query="foo - bar">
>  <cts:and-query strength="20" qtextjoin="" qtextgroup="( )" 
> xmlns:cts="http://marklogic.com/cts";>
>    <cts:word-query qtextref="cts:text">
>      <cts:text>foo</cts:text>
>    </cts:word-query>
>    <cts:not-query qtextstart="-" strength="40">
>      <cts:word-query qtextref="cts:text">
>       <cts:text>bar</cts:text>
>      </cts:word-query>
>    </cts:not-query>
>  </cts:and-query>
>  <cts:and-query xmlns:cts="http://marklogic.com/cts";>
>    <cts:word-query>
>      <cts:text xml:lang="en">foo</cts:text>
>    </cts:word-query>
>    <cts:not-query>
>      <cts:word-query>
>       <cts:text xml:lang="en">bar</cts:text>
>      </cts:word-query>
>    </cts:not-query>
>  </cts:and-query>
> </test>
> <test query="foo–bar">
>  <cts:word-query qtextref="cts:text" xmlns:cts="http://marklogic.com/cts";>
>    <cts:text>foo–bar</cts:text>
>  </cts:word-query>
>  <cts:and-query xmlns:cts="http://marklogic.com/cts";>
>    <cts:word-query>
>      <cts:text xml:lang="en">foo</cts:text>
>    </cts:word-query>
>    <cts:word-query>
>      <cts:text xml:lang="en">bar</cts:text>
>    </cts:word-query>
>  </cts:and-query>
> </test>
> <test query="foo – bar">
>  <cts:and-query strength="20" qtextjoin="" qtextgroup="( )" 
> xmlns:cts="http://marklogic.com/cts";>
>    <cts:word-query qtextref="cts:text">
>      <cts:text>foo</cts:text>
>    </cts:word-query>
>    <cts:and-query strength="20" qtextjoin="" qtextgroup="( )">
>      <cts:word-query qtextref="cts:text">
>       <cts:text>–</cts:text>
>      </cts:word-query>
>      <cts:word-query qtextref="cts:text">
>       <cts:text>bar</cts:text>
>      </cts:word-query>
>    </cts:and-query>
>  </cts:and-query>
>  <cts:and-query xmlns:cts="http://marklogic.com/cts";>
>    <cts:word-query>
>      <cts:text xml:lang="en">foo</cts:text>
>    </cts:word-query>
>    <cts:word-query>
>      <cts:text xml:lang="en">bar</cts:text>
>    </cts:word-query>
>  </cts:and-query>
> </test>
> 
> -- Mike
> 
> On 26 Jan 2012, at 18:13 , Danny Sokolsky wrote:
> 
>> Sorry Will, I misunderstood (thought you meant the - was being treated as 
>> negation).
>> 
>> Since you are pulling those from data that you know is in your database, how 
>> about if you make the whole thing a phrase put surrounding it with quotes.  
>> Here is an example of what I mean:
>> 
>> xquery version "1.0-ml";
>> 
>> import module namespace search = 
>> "http://marklogic.com/appservices/search";
>> at "/MarkLogic/appservices/search/search.xqy";
>> 
>> search:parse('"Venue &#x2015; Motion to Transfer"')
>> 
>> -Danny
>> 
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Will Thompson
>> Sent: Thursday, January 26, 2012 6:01 PM
>> To: General MarkLogic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] en/em dashes punctuation?
>> 
>> Thanks Danny, but I'm not sure I follow. Maybe that was not the best 
>> explanation. Rather than use dashes like hyphens, I just want a search for 
>> something like "Venue &#x2015; Motion to Transfer" to ignore the dash when 
>> parsed. It appears to be treating it like a word instead and is not ignored:
>> 
>> cts:and-query(
>> (cts:word-query("Venue", ("case-insensitive", "punctuation-insensitive", 
>> "lang=en"), 1),
>>  cts:word-query("&#x2015;", ("case-insensitive", "punctuation-insensitive", 
>> "lang=en"), 1),
>>  cts:word-query("Motion", ("case-insensitive", "punctuation-insensitive", 
>> "lang=en"), 1),
>>  cts:word-query("to", ("case-insensitive", "punctuation-insensitive", 
>> "lang=en"), 1),
>>  cts:word-query("Transfer", ("case-insensitive", "punctuation-insensitive", 
>> "lang=en"), 1)),
>> ())
>> 
>> -Will
>> 
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Danny Sokolsky
>> Sent: Thursday, January 26, 2012 5:35 PM
>> To: General MarkLogic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] en/em dashes punctuation?
>> 
>> Hi Will,
>> 
>> One thing you can do is change your search grammar to use a joiner other 
>> than the negative sign.
>> 
>> Here is the default grammar:
>> 
>> http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/xml/search-dev-guide/search-api.xml%2344520
>> 
>> -Danny
>> 
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Will Thompson
>> Sent: Thursday, January 26, 2012 4:34 PM
>> To: General MarkLogic Developer Discussion
>> Subject: [MarkLogic Dev General] en/em dashes punctuation?
>> 
>> Our search autocomplete pulls from doc titles, some of which contain en or 
>> em dashes. However, if the dash is "floating"- i.e.: "Venue - Motion to 
>> Transfer" - search:parse parses it into the query, even though 
>> <term-option>punctuation-insensitive</term-option> is included in the <term> 
>> section of the search options node. I thought it may just be getting ignored 
>> when it's evaluated but it's definitely limiting the query.
>> 
>> I can confirm they are punctuation: cts:tokenize("hyphen-en-em-bar―")[. 
>> instance of cts:punctuation] => "- - - ―"
>> 
>> But is there an exception here (the same way hyphens are always parsed to 
>> negate)? Do I just need to remove these from the query string before calling 
>> search:parse? If there is a cleaner way, that would be great.
>> 
>> 
>> Best,
>> 
>> Will
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] en/em dashes punctuation?

Reply via email to