As several have pointed out, the regular expressions in 4.1 are much more
conformant to the spec then they were in 4.0. As a result, there are some
minor code changes that are sometimes necessary when moving an application from
4.0 (or 3.2) to 4.1. These are Good Things, as the new regexes are more
efficient and fix several bugs in the previous implementation. Here is a blurb
from the 4.1 release notes pointing out some of these incompatibilities:
Regular Expression Changes
The regular expression evaluation in 4.1 has been improved and is more
efficient than in 4.0. It is also more conformant to the XQuery specification
than 4.0. Some of these conformance changes will cause some regular expressions
to behave differently in 4.1 than they did in 4.0. Regular expressions are used
in the fn:matches, fn:tokenize, and fn:replace functions. Some of the changes
are as follows:
* You can no longer match the empty string in a regular expression with
fn:replace or fn:tokenize (you can with fn:matches, however). Previously, the
empty string in fn:replace and fn:tokenize matched everything, but in 4.1 it
throws an XDMP-MATCHZERO exception.
* If you place an invalid escape sequence in a regular expression, an
exception is raised. In 4.0, some invalid escape sequences (for example, \/)
were allowed. In 4.1, any invalid escape sequence throws an exception.
* Certain invalid character classes are no longer allowed. For example, the
regular expression [z-a] is acceptable in 4.0 (although it does nothing), but
throws an exception in 4.1. All invalid character classes now throw an
exception.
For example, each of the following expressions contains an invalid regular
expression, and throws an exception in 4.1 but completes in 4.0:
xquery version "1.0-ml";
fn:replace("","\s*","x"),
fn:replace("http://marklogic.com", "\/", "X"),
fn:replace("abc", "[z-a]", "z")
(:
Throws exception in 4.1, returns the following in 4.0:
x
http:XXmarklogic.com
abc
:)
If you have any code that uses these regular expressions, you should review the
regular expressions and rewrite it as needed.
-Danny
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of G. Ken Holman
Sent: Wednesday, March 24, 2010 7:15 AM
To: [email protected]
Subject: Re: [MarkLogic Dev General] fn:tokenize change from v3 to v4?
At 2010-03-24 08:53 -0500, Strawn, M. Shane wrote:
>Content-class: urn:content-classes:message
>Content-Type: multipart/alternative;
> boundary="----_=_NextPart_001_01CACB59.729D3935"
>
>In v3, fn:tokenize would do this:
>
>fn:tokenize("word", "") ==> ("w", "o", "r", "d")
>
>...but in v4 that returns an error:
>
>[1.0-ml] XDMP-MATCHZERO: (err:FORX0003)
>fn:tokenize("word", "") -- Pattern matches zero-length string
>
>With the 2nd param interpreted as a reg-ex
>pattern, I'm not sure why it ever worked, but it was handy.
>
>Any comments on why this is the case?
It is specified to do so ... from 7.6.4 fn:tokenize:
If the supplied $pattern matches a zero-length string,
that is, if fn:matches("", $pattern, $flags) returns
true, then an error is raised: [err:FORX0003].
I'm guessing because there would be an infinite
number of occurrences of an empty string
in-between two characters before "moving" to the
next character in the search for non-matching substrings.
>How's this on speed/efficiency for replicating
>it? Not that I know how fast the tokenize version was.
Nor do I ... but anything accomplished by the
processor would be faster than anything coded by
hand (modulo any optimization and rewriting done
by the processor). So the objective would be to
find a written algorithm that would compare well to other written algorithms.
>for $pos in 1 to fn:string-length("word") return
>fn:substring("word", $pos, 1) ==> ("w", "o", "r", "d")
I think the following might be faster because you
won't be indexing into the entire input string
once for each character in the string, yet it is
using the same number of function invocations:
T:\ftemp>type shane.xq
for $each in string-to-codepoints( "word" )
return codepoints-to-string( $each )
T:\ftemp>xquery shane.xq
<?xml version="1.0" encoding="UTF-8"?>w o r d
T:\ftemp>
I hope this helps.
. . . . . . . . . . . . Ken
--
XSLT/XQuery training: San Carlos, California 2010-04-26/30
Principles of XSLT for XQuery Writers: San Francisco,CA 2010-05-03
XSLT/XQuery training: Ottawa, Canada 2010-05-10/14
XSLT/XQuery/UBL/Code List training: Trondheim,Norway 2010-06-02/11
Vote for your XML training: http://www.CraneSoftwrights.com/q/i/
Crane Softwrights Ltd. http://www.CraneSoftwrights.com/q/
G. Ken Holman mailto:[email protected]
Male Cancer Awareness Nov'07 http://www.CraneSoftwrights.com/q/bc
Legal business disclaimers: http://www.CraneSoftwrights.com/legal
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general