Re: Request for your Regular Expressions (Re: (XERCESJ-589) Bug with pattern restriction on long strings)

Michael Glavassevich Mon, 25 Jun 2007 05:43:36 -0700

Hi Geoff,

The W3C test suite contains many regex tests, particularly this large 
bucket [2] of tests contributed last year. That should give you a pretty 
good selection though beware that some of the tests are invalid. The known 
problems are documented in the W3C's Bugzilla here [3].


As for the code, one thing that may not be obvious is that it needs to be 
thread-safe. This is because the RegularExpression objects are cached in 
the schema grammar which could be shared with several parsers and 
validators. To avoid having many large synchronized blocks, the matching 
code keeps its state local to the call stack. Hoping that's the approach 
you've been taking.

Thanks.

[1] http://www.w3.org/XML/2004/xml-schema-test-suite/index.html#releases
[2] 
http://dev.w3.org/cvsweb/XML/xml-schema-test-suite/2004-01-14/xmlschema2006-11-06/msMeta/Regex_w3c.xml
[3] 
http://www.w3.org/Bugs/Public/buglist.cgi?query_format=specific&order=relevance+desc&bug_status=__open__&product=XML+Schema+Test+Suite&content=

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [EMAIL PROTECTED]
E-mail: [EMAIL PROTECTED]

"Geoff M. Granum" <[EMAIL PROTECTED]> wrote on 
06/25/2007 04:15:27 AM:

> (If you don't care about the particulars, but have some Regex's you can 
> contribute, jump to the code bit. Thanks)
> 
> I have two implementations to test; one is a (somewhat) naive linked 
list 
> stack manager, the other is (as yet) still recursive.
> 
> The former works, but I put it together as a proof of concept and don't 
> trust it much. Fifty-two return points in one method is a tad much. 
> Implemented as a raw java.util.Stack is ten times as slow as the 
original, 
> and creating a private static LocalStack class as a LinkedList is twice 
as 
> slow.
> 
> Though, 10K runs of the first thousand chars of the two example regex 
> patterns take ~1.2 and 2.6 seconds, respectively. So .12ms and .26ms per 
 
> run. I'm rather set against ANY performance decrement, or I'd have just 
> verified that code and moved on.
> 
> The latter implementation is a refactor of the method to a single point 
of 
> exit. THAT goal is working, now I have to make sure that I can add 
values 
> to an internal stack manager without blowing away any state -- some of 
the 
> CASE statements are a mite obtuse, and I don't like using breaks much. 
> Breaks also seem to affect the ability of the optimizer to do its job, 
as 
> the last CASE I modified (op.CLOSURE) gave a 10% performance boost 
without 
> it. Although I'm suspicious, as it's late and now the stack overflows 
> somewhat (ok, a lot) earlier than before. I did add a number of 
variables, 
> so it's possible I made no mistake in the logic (I'd better not have!).
> 
> --- The request part ---
> 
> Regardless of the final form, I need to populate a test library:
> 
> I have a few regular expressions lying around, and I figure I'll parse 
in 
> a few of my XML files and modify the RegularExpression class to dump 
> anything it sees to a file... I still doubt I'd have more than 20, and 
> none of them shockingly complex.
> 
> So if you could send me your favorite regular expressions, along with a 
> couple of stings to match them against (some pass, some fail, but 
indicate 
> which), it would be a big help.
> 
> Even better, if you could format them like this sample:
> 
> testCases.add(new TestCase(
>    "Overall description",
>    "Your Regex Pattern",
>    new SubCase("A description", shouldPass, "matchString" ),
>    new SubCase("A description2", shouldPass, "matchString2" ),
>    ... more SubCases ...
> ));
> 
> I would be able to paste them straight into the unit test and run them. 
> The SubCase argument uses varArgs, so add as many as you want/will. Feel 
 
> free to add your 'contributed by:' to the overall description area for 
> credit... Though I'd remind you not to include a parsable (or any, lest 
> random-someone ask you for help later) e-mail address on this list, as 
it 
> is public and archived.
> 
> My own direct e-mail address is (my first name @ my last name).biz. And 
if 
> someone has written a parser for THAT, they can have it.
> 
> The more complex your tests the better, for the beat down. Tailored 
> regex's would be grand for focused testing (e.g. the simplest lookahead, 
 
> lookbehind, singleline, multiline, etc). But I figure that's asking for 
> real work.
> 
> Also, or instead, if you have a 'regular expression rich' schema and 
> conforming xml file that you can send (think 'might become public'), I 
> should be able to parse those out without much trouble.
> 
> And yes (obviously), my test library uses 1.5 features... I'll convert 
it 
> if the changes are approved for commit. Keeps me sane.
> 
> Of course the changes to RegularExpression are using JDK 1.3 as a 
target, 
> as that is the lowest I've available. My memory of the differences 
between 
> 1.2 and 1.3 are fuzzy, but I don't think anything I'm using has changed 
> since 1.0. My only real concern is that my JVM has a better optimizer 
and 
> could be hiding poor performance that I induce.
> 
> 
> Thanks much,
> -- 
> Geoff M. Granum
> 760-534-1636
> Portland, Oregon
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Request for your Regular Expressions (Re: (XERCESJ-589) Bug with pattern restriction on long strings)

Reply via email to