On 2026-03-12 11:18, Jonathan Dowland wrote:
Hi, I've done an initial play with MoinLight, here are some notes.

The fact it re-uses the actual parser from Moin is very attractive, as
is the fact it can be run with Python 3. There are a couple of warnings
generated when used with Python 3.13.5; these look like they're easy
fixes. e.g.

/app/moinformat/parsers/moin.py:818: SyntaxWarning: invalid escape sequence '\s'
  expect("::\s+"))),                              # :: ws...

For the record, it isn't using Moin's parser at all: it's a completely rewritten parser because I wanted a document tree which Moin 1.9's parser couldn't provide.

I will admit to not having tried Python 3.15, using whatever Debian stable is currently using, so random string escaping warnings are fairly unsurprising given the developers' habit of constantly tweaking such things. But in this case, I imagine that I simply forgot to use a raw string.

I threw it into a container to evaluate, that wasn't hard. I attach the
Dockerfile I wrote.

The format of wiki data dumps is a little awkward and needs
pre-processing to be a suitable input. The format looks like (e.g.)

[...]

This isn't anything wrong with MoinLight, but means I need to
post-process a wiki dump into something more easily consumable by it (or
anything else). I copied out 200 random pages as a test source

ls data/pages/ | sort --random-sort | head -200 | while read page; do
rev="$(cat data/pages/$page/current)"; cat
"data/pages/$page/revisions/$rev" > "extract/$page"; done

I did something similar when experimenting, but MoinLight can be made to process the current revisions of a wiki data directory using the appropriate input directory type. Its original purpose was to process a simple directory hierarchy of pages, but obviously, Moin has a directory structure which contains page revisions and current revision information.

Unfortunately, I didn't have time to provide guidance on running the moinconvert tool on the data dumps, but you can use the --input-dir-type option with a value of moindirectory to have the tool effectively provide a converted dump of the current state of a wiki in a publishable form. Something like this should work:

./moinconvert --input-dir data/pages --input-dir-type moindirectory --output-dir html_pages --wikiwords --macros --all

By default, HTML output is produced in a directory hierarchy for publishing by a Web server, since this was the motivating use-case for the tool. But you can also obtain a directory hierarchy with appropriate links that can be viewed in the local filesystem by adding the following option:

  --document-index index.html

Using the moindirectory input type also allows attachments to be incorporated into the output.

Of course, you can choose a different output format. Quite useful is the pretty output type which produces document trees for all files:

  --output-type pretty

This is like the --tree option which is really just for single documents. You can then use Unix tools to fairly easily search the output for things like macros. Unfortunately, I also didn't have time to provide guidance on that, either, but I wrote some scripts to determine macro usage, for instance.

Converting them was very fast (0m0.818s). I might try the whole corpus
later.

The only real disadvantage of having the tool do the extraction of revisions is that it isn't written to support parallel processing, whereas you can write a shell pipeline to invoke the tool on individual files with xargs and its multiprocessing option, thus making the translation go a lot faster.

Quoting --help (--wikiwords):

unlike Moin, bare wikiwords do not produces links with this tool

I think it would be *very* useful if they did. Wiki.debian.org relies
heavily on WikiWords.

I'll probably extend the tool to support them, but such support won't be enabled by default. I obviously acknowledge that wikis tend to rely on them due to Moin's default behaviour.

Paul

P.S. Wikiwords are an annoying hangover from the early days of wikis, presumably introduced by Ward Cunningham himself. Not only do they encourage bizarre style in documents if not accompanied by explicit link labels, but inadvertent wikiwords - often things like product names - obviously cause links to be introduced, and then people have traditionally liked to create low-information pages for those links, cluttering any given wiki with what I might charitably call "guff". It is rather telling that Wikipedia doesn't support them, but I also moved away from them for my own documentation purposes, not least because I was fed up prefixing such words with "!" and generally polluting the source text.

P.P.S. I don't have a great deal of time to support this tool at the moment. I'll try and give advice, but I have more urgent matters to attend to.

Reply via email to