RE: [MarkLogic Dev General] fn:tokenize change from v3 to v4?

Strawn, M. Shane Wed, 24 Mar 2010 10:04:19 -0700

Thanks all for the pointed input here.  Of course---string-to-codepoint,
codepoint-to-string, nice.

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Danny
Sokolsky
Sent: Wednesday, March 24, 2010 12:54 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] fn:tokenize change from v3 to v4?

As several have pointed out, the regular expressions in 4.1 are much
more conformant to the spec then they were in 4.0.  As a result, there
are some minor code changes that are sometimes necessary when moving an
application from 4.0 (or 3.2) to 4.1.  These are Good Things, as the new
regexes are more efficient and fix several bugs in the previous
implementation.  Here is a blurb from the 4.1 release notes pointing out
some of these incompatibilities:

Regular Expression Changes

The regular expression evaluation in 4.1 has been improved and is more
efficient than in 4.0. It is also more conformant to the XQuery
specification than 4.0. Some of these conformance changes will cause
some regular expressions to behave differently in 4.1 than they did in
4.0. Regular expressions are used in the fn:matches, fn:tokenize, and
fn:replace functions. Some of the changes are as follows:

    * You can no longer match the empty string in a regular expression
with fn:replace or fn:tokenize (you can with fn:matches, however).
Previously, the empty string in fn:replace and fn:tokenize matched
everything, but in 4.1 it throws an XDMP-MATCHZERO exception.

    * If you place an invalid escape sequence in a regular expression,
an exception is raised. In 4.0, some invalid escape sequences (for
example, \/) were allowed. In 4.1, any invalid escape sequence throws an
exception.

    * Certain invalid character classes are no longer allowed. For
example, the regular expression [z-a] is acceptable in 4.0 (although it
does nothing), but throws an exception in 4.1. All invalid character
classes now throw an exception.

For example, each of the following expressions contains an invalid
regular expression, and throws an exception in 4.1 but completes in 4.0:

    xquery version "1.0-ml";

    fn:replace("","\s*","x"),
    fn:replace("http://marklogic.com";, "\/", "X"),
    fn:replace("abc", "[z-a]", "z")
    (: 
       Throws exception in 4.1, returns the following in 4.0:
       x
       http:XXmarklogic.com
       abc
    :)

If you have any code that uses these regular expressions, you should
review the regular expressions and rewrite it as needed.

-Danny

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of G. Ken
Holman
Sent: Wednesday, March 24, 2010 7:15 AM
To: [email protected]
Subject: Re: [MarkLogic Dev General] fn:tokenize change from v3 to v4?

At 2010-03-24 08:53 -0500, Strawn, M. Shane wrote:
>Content-class: urn:content-classes:message
>Content-Type: multipart/alternative;
>         boundary="----_=_NextPart_001_01CACB59.729D3935"
>
>In v3, fn:tokenize would do this:
>
>fn:tokenize("word", "") ==> ("w", "o", "r", "d")
>
>...but in v4 that returns an error:
>
>[1.0-ml] XDMP-MATCHZERO: (err:FORX0003) fn:tokenize("word", "") -- 
>Pattern matches zero-length string
>
>With the 2nd param interpreted as a reg-ex pattern, I'm not sure why it

>ever worked, but it was handy.
>
>Any comments on why this is the case?

It is specified to do so ... from 7.6.4 fn:tokenize:

   If the supplied $pattern matches a zero-length string,
   that is, if fn:matches("", $pattern, $flags) returns
   true, then an error is raised: [err:FORX0003].

I'm guessing because there would be an infinite number of occurrences of
an empty string in-between two characters before "moving" to the next
character in the search for non-matching substrings.

>How's this on speed/efficiency for replicating it?  Not that I know how

>fast the tokenize version was.

Nor do I ... but anything accomplished by the processor would be faster
than anything coded by hand (modulo any optimization and rewriting done
by the processor).  So the objective would be to find a written
algorithm that would compare well to other written algorithms.

>for $pos in 1 to fn:string-length("word") return fn:substring("word", 
>$pos, 1) ==> ("w", "o", "r", "d")

I think the following might be faster because you won't be indexing into
the entire input string once for each character in the string, yet it is
using the same number of function invocations:

T:\ftemp>type shane.xq
for $each in string-to-codepoints( "word" ) return codepoints-to-string(
$each ) T:\ftemp>xquery shane.xq <?xml version="1.0" encoding="UTF-8"?>w
o r d T:\ftemp>

I hope this helps.

. . . . . . . . . . . . Ken

--
XSLT/XQuery training:         San Carlos, California 2010-04-26/30
Principles of XSLT for XQuery Writers: San Francisco,CA 2010-05-03
XSLT/XQuery training:                 Ottawa, Canada 2010-05-10/14
XSLT/XQuery/UBL/Code List training: Trondheim,Norway 2010-06-02/11
Vote for your XML training:   http://www.CraneSoftwrights.com/q/i/
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/q/
G. Ken Holman                 mailto:[email protected]
Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/q/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] fn:tokenize change from v3 to v4?

Reply via email to