RE: [Jprogramming] Regex crashes in J6.01c

Henry Rich Tue, 26 Dec 2006 09:21:06 -0800

I am trying to verify that when I read a web page, I got all
the HTML before I process it.  The page might contain
something like a transaction-number, and if the connection
was lost, I want to go ahead and process the data if it
is valid, even if the overall transfer failed.  Also, some
sites just occasionally stop in the middle of sending data
and I need to catch that.


So, my pattern is what I need.  I need to match a start tag
anywhere, and I want to make sure there is an end tag.

As I noted in an earlier reply, I'm not trying to match
NUL, I was just trying to hokey up a way to do multiline
matching (in other words, to emulate (?s).*   ).

Thanks very much for the quick, accurate, and well-informed
help.

Henry Rich 

> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Dan Bron
> Sent: Tuesday, December 26, 2006 12:00 PM
> To: [email protected]
> Subject: Re: [Jprogramming] Regex crashes in J6.01c
> 
> I should also point out:
> 
>          stringhashtml 'embedded <html>...</html> html'
>       1
> 
> with either definition of  stringhashtml  (mine was written 
> to emulate yours).
> 
> I cite all these examples only so you're aware of what you're 
> matching  
> against.  If the string must start with <html> and end with </html>,  
> you'd have to write:
> 
>          (?i)^<html>[^\0]*</html>$
> 
> instead (and it could still have [incorrectly] nested <html> tags).
> 
> Also, I don't see why you bother matching against the closing tag,  
> because you don't use capturing parens (back references).  Do 
> you want  
> to ensure "well formed" HTML?  If so, do you parse the string later?
> 
> Or is it perhaps that you want to ensure that the HTML tags enclose  
> SOMETHING, even if it's only a single character?  If so, 
> you'd have to  
> replace the '*' with a '+', i.e.:
> 
>          (?i)<html>[^\0]+</html>
> 
> (I noted the '*' in your original expression, but I also saw 
> an empty  
> character class '[]', which I didn't understand, but thought 
> might be  
> an attempt at "match something".)
> 
> If you don't care about the closing tag, and want only to ensure the  
> opening tag is not followed by nulls, you can make the 
> expression even  
> simpler (faster):
> 
>          (?i)<html>[^\0]*
> 
> (if it has to match at the beginning of the string, add the  
> ^  as above).
> 
> One further, but important note:  apparently the regex 
> library treats  
> input strings as null-terminated.  This is either a bug in 
> PCRE or J's  
> interface to it.  So, expressions that guard against nulls 
> are doomed.  
>   To wit:
> 
>          stringhashtml   'A'  ,  '<html>foo</html>'
>       1
>          stringhashtml   'A'  ,~ '<html>foo</html>'
>       1
>          stringhashtml ({.a.) ,  '<html>foo</html>'
>       0
>          stringhashtml ({.a.) ,~ '<html>foo</html>'
>       1
> 
> I only thought to check this because I ran into this bug 
> years ago.    
> Kirk Iverson fixed it in my local installation, but I have no idea  
> what it was, and I no longer have access to it (besides, it 
> was before  
> the switch to the PCRE library).
> 
> -Dan
> 
> ----------------------------------------------------------------------
> For information about J forums see 
> http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

RE: [Jprogramming] Regex crashes in J6.01c

Reply via email to