Greg Rundlett (freephile) wrote: > The project page: http://www.mediawiki.org/wiki/Extension:Html2Wiki > > It's an extension to MediaWiki that lets you "import a website or web page > into your wiki".
"It does this by first "normalizing" the content with HTMLTidy, and then "sanitizing" it with Purify and Regular Expressions. Then the content is "converted" from HTML to WikiText using Regular Expressions and a Parsoid service." Amazing that such a conversion is even possible, given how problematic most HTML is. In some ways this job is harder than what browsers do when parsing HTML, as you aren't just rendering the result, but trying to extract structure - or semantic meaning - from it. Does HTMLTidy do a lot of the heavy lifting for you? Do you still end up with a lot of situations where you have multiple HTML constructs that map to a single wiki markup construct? How does it handle HTML generated or loaded by JS, as is quite common now? (You might be able to work around that with one of the projects that use an embedded and programmatically controlled web rendering engine, like webkit.) What are the advantages to implementing this as a plugin rather than a separate command line tool (which would then support other markup formats, like Markdown)? If you couldn't find an existing HTML to wiki markup converter, did you look for something similar, like a converter to markdown? A search for this turns up hits, such as: http://johnmacfarlane.net/pandoc/README.html with an example: pandoc -f html -t markdown http://www.fsf.org which presumably retrieves content from http://www.fsf.org, specified to be in HTML format, and outputs Markdown. (It also supports MediaWiki format.) If using a tool that doesn't support MediaWiki directly, once in Markdown, I imagine the conversion to MediaWiki is relatively easy. -Tom -- Tom Metro The Perl Shop, Newton, MA, USA "Predictable On-demand Perl Consulting." http://www.theperlshop.com/ _______________________________________________ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss