Thanks all for the pointed input here. Of course---string-to-codepoint, codepoint-to-string, nice.
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Danny Sokolsky Sent: Wednesday, March 24, 2010 12:54 PM To: General Mark Logic Developer Discussion Subject: RE: [MarkLogic Dev General] fn:tokenize change from v3 to v4? As several have pointed out, the regular expressions in 4.1 are much more conformant to the spec then they were in 4.0. As a result, there are some minor code changes that are sometimes necessary when moving an application from 4.0 (or 3.2) to 4.1. These are Good Things, as the new regexes are more efficient and fix several bugs in the previous implementation. Here is a blurb from the 4.1 release notes pointing out some of these incompatibilities: Regular Expression Changes The regular expression evaluation in 4.1 has been improved and is more efficient than in 4.0. It is also more conformant to the XQuery specification than 4.0. Some of these conformance changes will cause some regular expressions to behave differently in 4.1 than they did in 4.0. Regular expressions are used in the fn:matches, fn:tokenize, and fn:replace functions. Some of the changes are as follows: * You can no longer match the empty string in a regular expression with fn:replace or fn:tokenize (you can with fn:matches, however). Previously, the empty string in fn:replace and fn:tokenize matched everything, but in 4.1 it throws an XDMP-MATCHZERO exception. * If you place an invalid escape sequence in a regular expression, an exception is raised. In 4.0, some invalid escape sequences (for example, \/) were allowed. In 4.1, any invalid escape sequence throws an exception. * Certain invalid character classes are no longer allowed. For example, the regular expression [z-a] is acceptable in 4.0 (although it does nothing), but throws an exception in 4.1. All invalid character classes now throw an exception. For example, each of the following expressions contains an invalid regular expression, and throws an exception in 4.1 but completes in 4.0: xquery version "1.0-ml"; fn:replace("","\s*","x"), fn:replace("http://marklogic.com", "\/", "X"), fn:replace("abc", "[z-a]", "z") (: Throws exception in 4.1, returns the following in 4.0: x http:XXmarklogic.com abc :) If you have any code that uses these regular expressions, you should review the regular expressions and rewrite it as needed. -Danny -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of G. Ken Holman Sent: Wednesday, March 24, 2010 7:15 AM To: [email protected] Subject: Re: [MarkLogic Dev General] fn:tokenize change from v3 to v4? At 2010-03-24 08:53 -0500, Strawn, M. Shane wrote: >Content-class: urn:content-classes:message >Content-Type: multipart/alternative; > boundary="----_=_NextPart_001_01CACB59.729D3935" > >In v3, fn:tokenize would do this: > >fn:tokenize("word", "") ==> ("w", "o", "r", "d") > >...but in v4 that returns an error: > >[1.0-ml] XDMP-MATCHZERO: (err:FORX0003) fn:tokenize("word", "") -- >Pattern matches zero-length string > >With the 2nd param interpreted as a reg-ex pattern, I'm not sure why it >ever worked, but it was handy. > >Any comments on why this is the case? It is specified to do so ... from 7.6.4 fn:tokenize: If the supplied $pattern matches a zero-length string, that is, if fn:matches("", $pattern, $flags) returns true, then an error is raised: [err:FORX0003]. I'm guessing because there would be an infinite number of occurrences of an empty string in-between two characters before "moving" to the next character in the search for non-matching substrings. >How's this on speed/efficiency for replicating it? Not that I know how >fast the tokenize version was. Nor do I ... but anything accomplished by the processor would be faster than anything coded by hand (modulo any optimization and rewriting done by the processor). So the objective would be to find a written algorithm that would compare well to other written algorithms. >for $pos in 1 to fn:string-length("word") return fn:substring("word", >$pos, 1) ==> ("w", "o", "r", "d") I think the following might be faster because you won't be indexing into the entire input string once for each character in the string, yet it is using the same number of function invocations: T:\ftemp>type shane.xq for $each in string-to-codepoints( "word" ) return codepoints-to-string( $each ) T:\ftemp>xquery shane.xq <?xml version="1.0" encoding="UTF-8"?>w o r d T:\ftemp> I hope this helps. . . . . . . . . . . . . Ken -- XSLT/XQuery training: San Carlos, California 2010-04-26/30 Principles of XSLT for XQuery Writers: San Francisco,CA 2010-05-03 XSLT/XQuery training: Ottawa, Canada 2010-05-10/14 XSLT/XQuery/UBL/Code List training: Trondheim,Norway 2010-06-02/11 Vote for your XML training: http://www.CraneSoftwrights.com/q/i/ Crane Softwrights Ltd. http://www.CraneSoftwrights.com/q/ G. Ken Holman mailto:[email protected] Male Cancer Awareness Nov'07 http://www.CraneSoftwrights.com/q/bc Legal business disclaimers: http://www.CraneSoftwrights.com/legal _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
