wget css parsing

Ted Mielczarek Tue, 27 Jun 2006 07:46:11 -0700

Hello,

I have implemented a simple CSS parser in wget to handle things like @import rules and background-image: url(). It is mostly just a lexical scanner (implemented using flex) with a very dumb parser on top of that. The flex source comes directly from the CSS2.1 spec: http://www.w3.org/TR/CSS21/grammar.html#q2, so it should handle almost anything (unicode excluded). It definitely needs more testing. It works for my simple testcases, but I'm sure there are plenty of ways to break it. You will find my source tree, a diff, and an explanation of my changes here:
http://ted.mielczarek.org/code/wget-modified/

I made my changes against the 1.10 branch, because I intend to use this on a machine running Debian stable and I wanted minimal other problems. If there is interest I will port the changes to trunk. The diff does not include the few new files I created, but they're in the src directory: css-tokens.h - an enum of css lexical tokens, css-url.c - analogous to html-url.c, css-url.h - its header file, and css.lex - the flex source.

I did have to hack the html parser a bit to make this work properly, I added a tag stack to keep track of opening tags so I could handle the contents of <style> tags.

There's probably some mess left over from debugging and whatnot, but I'd appreciate it if people would take a look at this, play with it, and give me some feedback. Please CC me on any replies as I'm not subscribed to this mailing list.

Regards,
-Ted

wget css parsing

Reply via email to