RE: [MarkLogic Dev General] fn:tokenize change from v3 to v4?

Danny Sokolsky Wed, 24 Mar 2010 09:53:55 -0700

As several have pointed out, the regular expressions in 4.1 are much more 
conformant to the spec then they were in 4.0.  As a result, there are some 
minor code changes that are sometimes necessary when moving an application from 
4.0 (or 3.2) to 4.1.  These are Good Things, as the new regexes are more 
efficient and fix several bugs in the previous implementation.  Here is a blurb 
from the 4.1 release notes pointing out some of these incompatibilities:


Regular Expression Changes

The regular expression evaluation in 4.1 has been improved and is more 
efficient than in 4.0. It is also more conformant to the XQuery specification 
than 4.0. Some of these conformance changes will cause some regular expressions 
to behave differently in 4.1 than they did in 4.0. Regular expressions are used 
in the fn:matches, fn:tokenize, and fn:replace functions. Some of the changes 
are as follows:

    * You can no longer match the empty string in a regular expression with 
fn:replace or fn:tokenize (you can with fn:matches, however). Previously, the 
empty string in fn:replace and fn:tokenize matched everything, but in 4.1 it 
throws an XDMP-MATCHZERO exception.

    * If you place an invalid escape sequence in a regular expression, an 
exception is raised. In 4.0, some invalid escape sequences (for example, \/) 
were allowed. In 4.1, any invalid escape sequence throws an exception.

    * Certain invalid character classes are no longer allowed. For example, the 
regular expression [z-a] is acceptable in 4.0 (although it does nothing), but 
throws an exception in 4.1. All invalid character classes now throw an 
exception.

For example, each of the following expressions contains an invalid regular 
expression, and throws an exception in 4.1 but completes in 4.0:

    xquery version "1.0-ml";

    fn:replace("","\s*","x"),
    fn:replace("http://marklogic.com";, "\/", "X"),
    fn:replace("abc", "[z-a]", "z")
    (: 
       Throws exception in 4.1, returns the following in 4.0:
       x
       http:XXmarklogic.com
       abc
    :)

If you have any code that uses these regular expressions, you should review the 
regular expressions and rewrite it as needed.

-Danny

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of G. Ken Holman
Sent: Wednesday, March 24, 2010 7:15 AM
To: [email protected]
Subject: Re: [MarkLogic Dev General] fn:tokenize change from v3 to v4?

At 2010-03-24 08:53 -0500, Strawn, M. Shane wrote:
>Content-class: urn:content-classes:message
>Content-Type: multipart/alternative;
>         boundary="----_=_NextPart_001_01CACB59.729D3935"
>
>In v3, fn:tokenize would do this:
>
>fn:tokenize("word", "") ==> ("w", "o", "r", "d")
>
>...but in v4 that returns an error:
>
>[1.0-ml] XDMP-MATCHZERO: (err:FORX0003) 
>fn:tokenize("word", "") -- Pattern matches zero-length string
>
>With the 2nd param interpreted as a reg-ex 
>pattern, I'm not sure why it ever worked, but it was handy.
>
>Any comments on why this is the case?

It is specified to do so ... from 7.6.4 fn:tokenize:

   If the supplied $pattern matches a zero-length string,
   that is, if fn:matches("", $pattern, $flags) returns
   true, then an error is raised: [err:FORX0003].

I'm guessing because there would be an infinite 
number of occurrences of an empty string 
in-between two characters before "moving" to the 
next character in the search for non-matching substrings.

>How's this on speed/efficiency for replicating 
>it?  Not that I know how fast the tokenize version was.

Nor do I ... but anything accomplished by the 
processor would be faster than anything coded by 
hand (modulo any optimization and rewriting done 
by the processor).  So the objective would be to 
find a written algorithm that would compare well to other written algorithms.

>for $pos in 1 to fn:string-length("word") return 
>fn:substring("word", $pos, 1) ==> ("w", "o", "r", "d")

I think the following might be faster because you 
won't be indexing into the entire input string 
once for each character in the string, yet it is 
using the same number of function invocations:

T:\ftemp>type shane.xq
for $each in string-to-codepoints( "word" )
return codepoints-to-string( $each )
T:\ftemp>xquery shane.xq
<?xml version="1.0" encoding="UTF-8"?>w o r d
T:\ftemp>

I hope this helps.

. . . . . . . . . . . . Ken

--
XSLT/XQuery training:         San Carlos, California 2010-04-26/30
Principles of XSLT for XQuery Writers: San Francisco,CA 2010-05-03
XSLT/XQuery training:                 Ottawa, Canada 2010-05-10/14
XSLT/XQuery/UBL/Code List training: Trondheim,Norway 2010-06-02/11
Vote for your XML training:   http://www.CraneSoftwrights.com/q/i/
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/q/
G. Ken Holman                 mailto:[email protected]
Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/q/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] fn:tokenize change from v3 to v4?

Reply via email to