Greg Rundlett (freephile) wrote:
> The project page:
> It's an extension to MediaWiki that lets you "import a website or web page
> into your wiki".

  "It does this by first "normalizing" the content with HTMLTidy, and
  then "sanitizing" it with Purify and Regular Expressions. Then the
  content is "converted" from HTML to WikiText using Regular Expressions
  and a Parsoid service."

Amazing that such a conversion is even possible, given how problematic
most HTML is. In some ways this job is harder than what browsers do when
parsing HTML, as you aren't just rendering the result, but trying to
extract structure - or semantic meaning - from it.

Does HTMLTidy do a lot of the heavy lifting for you? Do you still end up
with a lot of situations where you have multiple HTML constructs that
map to a single wiki markup construct?

How does it handle HTML generated or loaded by JS, as is quite common
now? (You might be able to work around that with one of the projects
that use an embedded and programmatically controlled web rendering
engine, like webkit.)

What are the advantages to implementing this as a plugin rather than a
separate command line tool (which would then support other markup
formats, like Markdown)?

If you couldn't find an existing HTML to wiki markup converter, did you
look for something similar, like a converter to markdown? A search for
this turns up hits, such as:

with an example:

  pandoc -f html -t markdown

which presumably retrieves content from, specified to
be in HTML format, and outputs Markdown. (It also supports MediaWiki

If using a tool that doesn't support MediaWiki directly, once in
Markdown, I imagine the conversion to MediaWiki is relatively easy.


Tom Metro
The Perl Shop, Newton, MA, USA
"Predictable On-demand Perl Consulting."
Discuss mailing list

Reply via email to