> they are irrelevant and unavailable to this conversion exercise.  I
> could make them compliant and ignore them.  But since this is more a
> learning exercise then anything...

Fair enough on both counts. :)



> The main features I do not grok is the what role the [^>] plays

That's for if you don't know (or don't want to limit) what attributes
a tag contains.

A tag cannot contain a > character - any necessary ones would be
escaped as >

You could use a non-greedy wildcard like <tag.*?> but I use <tag[^>]*>
as it is more precise.



> how to interpret this part in front of the negative look behind; ?:[^/]/

Ah, you're mis-reading that slightly. There are a few parts at play
here, I'll attempt to explain them individually in a simpler
context...

(note, I'm adding spaces purely for readability - pretend there are no
spaces in any of the following examples)


( x | y ) is the standard "x OR y"  - the parentheses are necessary to
prevent the OR from applying to the whole of the expression.

However, using parentheses means that regex will capture the contents
for a backreference. This is not necessary here, so tell it to discard
the contents, we put ?: inside the parens, so we get (?: x | y )
p.s. this also works without the OR operator - just as (?: x )

The first part of the OR is [^/] which means simply "not /" - putting
caret (^) inside brackets negates them.
e.g [^abc] means "a single character that is not a nor b nor c"

Then, there's a negative lookahead which is (?! x ) and is the inverse
of a regular lookahead - i.e. it makes sure the contents of the parens
are NOT there.
As with all lookarounds, it is zero-width - it matches only a position
not actual characters. That is perhaps the key to understanding how
they work - that no characters are ever consumed by a lookaround, but
they still must match against the characters that follow the current
position.

Since we're dealing with a position, we need a preceeding character to
actually proceed with the match
For example x (?! y ) will match any x that is not followed a y (but
it will match only the x and will continue checking the rest of the
pattern from the next character).

Since I mentioned the non-capturing (?: x ) above, I'll point out that
this command is implicit in all lookarounds - they do not capture
their contents for backreferences.


So, to put all that together, what all this (?: [^/] |  / (?! td> ) )
is actually saying is:
Look for anything that is not a slash OR if you do find a slash only
accept it if it is not followed by the characters "td>", and when you
find either of these don't bother remembering it and just move on.

Or, put simpler, "if you find /td> in this section then stop trying to match"



Hopefully all of that makes sense? Feel free to ask if any part is unclear. :)




-- 
Peter Boughton
//hybridchill.com
//blog.bpsite.net

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Adobe® ColdFusion® 8 software 8 is the most important and dramatic release to 
date
Get the Free Trial
http://ad.doubleclick.net/clk;203748912;27390454;j

Archive: http://www.houseoffusion.com/groups/RegEx/message.cfm/messageid:1174
Subscription: http://www.houseoffusion.com/groups/RegEx/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.21

Reply via email to