-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 23.02.2015 04:27, travis+ml-lang...@subspacefield.org wrote: > Quote: This post is part of a series > <http://danlec.com/blog/hacking-stackoverflow-com> describing the > 33 security vulnerabilities I reported > tostackoverflow.com<http://stackoverflow.com/> from 2009-2013. > This particular exploit was reported and fixed in 2009. > http://danlec.com/blog/hacking-stackoverflow-com-s-html-sanitizer > > Funny: > http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 > > Accurate: I think the flaw here is that HTML is a Chomsky Type 2 > grammar (context free > grammar)<http://en.wikipedia.org/wiki/Context-free_grammar> and > RegEx is a Chomsky Type 3 grammar (regular > grammar)<http://en.wikipedia.org/wiki/Regular_grammar>. Since a > Type 2 grammar is fundamentally more complex than a Type 3 grammar > (see the Chomsky > hierarchy<http://en.wikipedia.org/wiki/Chomsky_hierarchy>), you > can't possibly make this work. But many will try, some will claim > success and others will find the fault and totally mess you up. >
Hi, thanks for this information. As I read up about this topic I stumbled over this article: http://blog.codinghorror.com/parsing-html-the-cthulhu-way/ It claims that perl has "solved" this problem with "HTML::SANITIZER" http://search.cpan.org/~nesting/HTML-Sanitizer-0.04/Sanitizer.pm (dead link) This doesn't seem to be around anymore, so I searched and I found this: http://search.cpan.org/~nigelm/HTML-Scrubber-0.11/lib/HTML/Scrubber.pm It actually uses this for parsing the html: "I wasn't satisfied with HTML::Sanitizer because it is based on HTML::TreeBuilder, so I thought I'd write something similar that works directly with HTML::Parser." ok let's look at "HTML::Parser" : http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.71/Parser.pm It requires "HTML::Entities", so let's look there first: http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.71/lib/HTML/Entities.pm and there it is: sub encode_entities { return undef unless defined $_[0]; my $ref; if (defined wantarray) { my $x = $_[0]; $ref = \$x; # copy } else { $ref = \$_[0]; # modify in-place } if (defined $_[1] and length $_[1]) { unless (exists $subst{$_[1]}) { # Because we can't compile regex we fake it with a cached sub my $chars = $_[1]; $chars =~ s,(?<!\\)([]/]),\\$1,g; $chars =~ s,(?<!\\)\\\z,\\\\,; my $code = "sub {\$_[0] =~ s/([$chars])/\$char2entity{\$1} || num_entity(\$1)/ge; }"; $subst{$_[1]} = eval $code; die( $@ . " while trying to turn range: \"$_[1]\"\n " . "into code: $code\n " ) if $@; } &{$subst{$_[1]}}($$ref); } else { # Encode control chars, high bit chars and '<', '&', '>', ''' and '"' $$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} || num_entity($1)/ge; } $$ref; } I don't know really much about perl, but I think in the end this also uses regex to try to parse regex, doesn't it? I asked myself how python does do this (if you got a problem, there's almost a guarantee that python has a module which solves it). turns out it does it like this: python 2: https://hg.python.org/cpython/file/2.7/Lib/HTMLParser.py python 3: https://hg.python.org/cpython/file/3.4/Lib/html/parser.py I just had a quick look at the code, so I might be wrong, but it looks to me all those modules do use regex in the end to parse html? What do you think? Maybe I should inform these programmers? kind regards Sven -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQGcBAEBAgAGBQJU7vjkAAoJEAq0kGAWDrql0swMAJ5sdbWGawjG/sO+suUxguD0 py/ASDhpeiESUtRhri9B5cZbNB/29t0v62MqXIt9Lez1a2v6A2NqK/6fyk4YgGKf cjXRrGJaXERFE2HWK+G0MStZK3drY6T22YMFtIImdfQZ3ERx4gz+UwRI+z0aXD6i 1Ew9Nu6ftU2fmJoDek+SqEJm0Q2vviWoH0drzjMU55VfNhhfeOQLjCV+9ocXoAQo apDG5v4jGV0AOW6vyCtwKfsxTaK7bcBwFwdHz+sGVQ6LFNTRV6K5yN/hxY0gw/s1 k5cw4/7SpmDOft1HLzdqOnBSxn0EjIj3MELyj0zbsCiler76ytiTDFjGWkrYsZce LHC/OYiH6EoCoNb8cImR6XNw7/4mtHMwpT/PuD+bvYUMKSrUD/VL+tDhdeETUtMA k7o2ZS3SHqRn1Dhzg1K/EVOQIOMnKm+768M0y4rwnH9xorKFE9wQopBJs+Kx2Igq 3LxzTPJST3xI1HQmnn2mcaBP+bfphQydJHKL4Vxfvw== =W3nR -----END PGP SIGNATURE----- _______________________________________________ langsec-discuss mailing list langsec-discuss@mail.langsec.org https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss