Re: moinmoin2html2mediawiki - initial implementation

Paul Boddie Thu, 12 Mar 2026 15:05:12 -0700

On 2026-03-12 11:18, Jonathan Dowland wrote:

Hi, I've done an initial play with MoinLight, here are some notes.


The fact it re-uses the actual parser from Moin is very attractive, as
is the fact it can be run with Python 3. There are a couple of warnings
generated when used with Python 3.13.5; these look like they're easy
fixes. e.g.

/app/moinformat/parsers/moin.py:818: SyntaxWarning: invalid escapesequence '\s'

  expect("::\s+"))),                              # :: ws...

For the record, it isn't using Moin's parser at all: it's a completelyrewritten parser because I wanted a document tree which Moin 1.9'sparser couldn't provide.

I will admit to not having tried Python 3.15, using whatever Debianstable is currently using, so random string escaping warnings are fairlyunsurprising given the developers' habit of constantly tweaking suchthings. But in this case, I imagine that I simply forgot to use a rawstring.

I threw it into a container to evaluate, that wasn't hard. I attach the
Dockerfile I wrote.

The format of wiki data dumps is a little awkward and needs
pre-processing to be a suitable input. The format looks like (e.g.)


[...]

This isn't anything wrong with MoinLight, but means I need to

post-process a wiki dump into something more easily consumable by it(or

anything else). I copied out 200 random pages as a test source

ls data/pages/ | sort --random-sort | head -200 | while read page; do
rev="$(cat data/pages/$page/current)"; cat
"data/pages/$page/revisions/$rev" > "extract/$page"; done

I did something similar when experimenting, but MoinLight can be made toprocess the current revisions of a wiki data directory using theappropriate input directory type. Its original purpose was to process asimple directory hierarchy of pages, but obviously, Moin has a directorystructure which contains page revisions and current revisioninformation.

Unfortunately, I didn't have time to provide guidance on running themoinconvert tool on the data dumps, but you can use the --input-dir-typeoption with a value of moindirectory to have the tool effectivelyprovide a converted dump of the current state of a wiki in a publishableform. Something like this should work:

./moinconvert --input-dir data/pages --input-dir-type moindirectory--output-dir html_pages --wikiwords --macros --all

By default, HTML output is produced in a directory hierarchy forpublishing by a Web server, since this was the motivating use-case forthe tool. But you can also obtain a directory hierarchy with appropriatelinks that can be viewed in the local filesystem by adding the followingoption:


  --document-index index.html

Using the moindirectory input type also allows attachments to beincorporated into the output.

Of course, you can choose a different output format. Quite useful is thepretty output type which produces document trees for all files:


  --output-type pretty

This is like the --tree option which is really just for singledocuments. You can then use Unix tools to fairly easily search theoutput for things like macros. Unfortunately, I also didn't have time toprovide guidance on that, either, but I wrote some scripts to determinemacro usage, for instance.

Converting them was very fast (0m0.818s). I might try the whole corpus
later.

The only real disadvantage of having the tool do the extraction ofrevisions is that it isn't written to support parallel processing,whereas you can write a shell pipeline to invoke the tool on individualfiles with xargs and its multiprocessing option, thus making thetranslation go a lot faster.

Quoting --help (--wikiwords):

unlike Moin, bare wikiwords do not produces links with this tool


I think it would be *very* useful if they did. Wiki.debian.org relies
heavily on WikiWords.

I'll probably extend the tool to support them, but such support won't beenabled by default. I obviously acknowledge that wikis tend to rely onthem due to Moin's default behaviour.


Paul

P.S. Wikiwords are an annoying hangover from the early days of wikis,presumably introduced by Ward Cunningham himself. Not only do theyencourage bizarre style in documents if not accompanied by explicit linklabels, but inadvertent wikiwords - often things like product names -obviously cause links to be introduced, and then people havetraditionally liked to create low-information pages for those links,cluttering any given wiki with what I might charitably call "guff". Itis rather telling that Wikipedia doesn't support them, but I also movedaway from them for my own documentation purposes, not least because Iwas fed up prefixing such words with "!" and generally polluting thesource text.

P.P.S. I don't have a great deal of time to support this tool at themoment. I'll try and give advice, but I have more urgent matters toattend to.

Re: moinmoin2html2mediawiki - initial implementation

Reply via email to