In JMeter development I've encountered the need to parse some specific HTML constructs REALLY FAST. To do this, I was using this regular expression:
<BASE(?=\s)[^\>]*\sHREF\s*=\s*"([^">]*)"
|<(?:IMG|SCRIPT)(?=\s)[^\>]*\sSRC\s*=\s*"([^">]*)"
|<APPLET(?=\s)[^\>]*\sCODE(?:BASE)?\s*=\s*"([^">]*)"
|<(?:EMBED|OBJECT)(?=\s)[^\>]*\s(?:SRC|CODEBASE)\s*=\s*"([^">]*)"
|<(?:BODY|TABLE|TR|TD)(?=\s)[^\>]*\sBACKGROUND\s*=\s*"([^">]*)"
|<INPUT(?=\s)(?:[^\>]*\s(?:SRC\s*=\s*"([^">]*)"|TYPE\s*=\s*"image")){2,}
|<LINK(?=\\s)(?:[^\>]*\s(?:HREF\s*=\s*"([^">]*)"|REL\s*=\s*"stylesheet\")){2,}but my ORO-based parser could hardly compete with Sourceforge's HtmlParser for the same job, which was shocking to me, because HtmlParser does much more work...
In a random attempt to improve the situation, I tried this:
<(?:
BASE(?=\s)[^\>]*\sHREF\s*=\s*"([^">]*)"
|(?:IMG|SCRIPT)(?=\s)[^\>]*\sSRC\s*=\s*"([^">]*)"
|APPLET(?=\s)[^\>]*\sCODE(?:BASE)?\s*=\s*"([^">]*)"
|(?:EMBED|OBJECT)(?=\s)[^\>]*\s(?:SRC|CODEBASE)\s*=\s*"([^">]*)"
|(?:BODY|TABLE|TR|TD)(?=\s)[^\>]*\sBACKGROUND\s*=\s*"([^">]*)"
|INPUT(?=\s)(?:[^\>]*\s(?:SRC\s*=\s*"([^">]*)"|TYPE\s*=\s*"image")){2,}
|LINK(?=\\s)(?:[^\>]*\s(?:HREF\s*=\s*"([^">]*)"|REL\s*=\s*"stylesheet\")){2,}
)now my test runs REALLY FAST!
First question (out of sheer curiosity): why is this later regexp faster than the earlier one?
Second question: I would like to run the regexps against the HTML content as a byte array (byte[]) without having to convert it into a string. Can ORO do this?
On the reasons why I don't want to do the byte[]-to-String conversion:
1/ Memory efficiency.
2/ I don't need it: even if there were multi-byte characters in the input, they are not part of my problem.
3/ The conversion can cause problems if the input is wrong.
Third question: I've read that byte-based regexp engines use a type of state machines which is significantly faster than char-based regexp engines. Am I correct? Can ORO take advantage of this? Could you recommend a regexp engine which can?
Thanks for your help,
Jordi.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
