wwwoffle-users  

[WWWOFFLE-Users] Modifying and censoring HTML

Paul A. Rombouts
Tue, 16 May 2006 22:13:40 -0700

One of the main reasons I keep using WWWOFFLE is that it is very useful
for avoiding ads. Unfortunately, the features offered by the DontGet and
ModifyHTML sections are insufficient to get rid of ads that are written
into the webpage itself, and not referenced by external links that can
be blocked by a DontGet list. An example of this that I find
particularly annoying, are the "Linux Reference Center" ads sponsored by
Microsoft in linuxtoday.com.

When I happened to learn about a Firefox extension called Greasemonkey,
that makes it possible to run user-written Javascript code to change
webpage content, it occurred to me that it could also be used to remove
almost any possible advertisement. This turned out to work quite well.
One of the shortcomings is that ads are removed after a page has loaded,
so that the ads can often still be seen temporarily.

The biggest shortcoming, though, was that I was forced to use Firefox to
enjoy this feature. And to my chagrin, when I upgraded to Firefox 1.5,
most of the Greasemonkey scripts that I had written were broken for some
reason.

One day, when I was looking at the source code in src/htmlmodify.l that
implemented the add-cache-info feature, I had a really neat idea. I
could add a new feature to WWWOFFLE that enables me to insert HTML code
from a local file into the page being modified. By putting my own
Javascript inside this HTML code, I could get most of the possibilities
of Greasemonkey in any browser that supported Javascript.

I have called the new option "insert-file", and it is included in the
patch that I publish at my WWWOFFLE webpage
http://www.phys.uu.nl/~rombouts/wwwoffle.html . I also have an example
of an include file that I use to get rid of the ads in linuxtoday.com
that I mentioned above:
http://www.phys.uu.nl/~rombouts/wwwoffle/linuxtoday_adblocker.html.txt

The latter example script uses the XPath (http://www.w3.org/TR/xpath)
evaluator that is available in mozilla based browsers (at least in the
fairly recent versions). I find that XPath is a very useful language for
expressing which parts of an HTML tree I want to get rid of. That is why
I am seriously thinking about implementing a feature in WWWOFFLE that
allows you to censor HTML on the basis of XPath expressions. The feature
would allow you to add something like this to the ModifyHTML section:

<http://*example.com> censor-html = //[EMAIL PROTECTED]'ad_content' or @class='adbox']

This would, for example, get rid of everything starting with a <div
id="xyz" class="ad_content"> tag up to and including the corresponding
</div> endtag, even before the content reaches the browser.

The "censor-html" feature is still only in the planning stage.
Nevertheless, I would be interested to know what people on this mailing
list think of this possibility.

--
Paul A. Rombouts