On 2026-03-12 11:18, Jonathan Dowland wrote:
Hi, I've done an initial play with MoinLight, here are some notes.
The fact it re-uses the actual parser from Moin is very attractive, as
is the fact it can be run with Python 3. There are a couple of warnings
generated when used with Python 3.13.5; these look like they're easy
fixes. e.g.
/app/moinformat/parsers/moin.py:818: SyntaxWarning: invalid escape
sequence '\s'
expect("::\s+"))), # :: ws...
For the record, it isn't using Moin's parser at all: it's a completely
rewritten parser because I wanted a document tree which Moin 1.9's
parser couldn't provide.
I will admit to not having tried Python 3.15, using whatever Debian
stable is currently using, so random string escaping warnings are fairly
unsurprising given the developers' habit of constantly tweaking such
things. But in this case, I imagine that I simply forgot to use a raw
string.
I threw it into a container to evaluate, that wasn't hard. I attach the
Dockerfile I wrote.
The format of wiki data dumps is a little awkward and needs
pre-processing to be a suitable input. The format looks like (e.g.)
[...]
This isn't anything wrong with MoinLight, but means I need to
post-process a wiki dump into something more easily consumable by it
(or
anything else). I copied out 200 random pages as a test source
ls data/pages/ | sort --random-sort | head -200 | while read page; do
rev="$(cat data/pages/$page/current)"; cat
"data/pages/$page/revisions/$rev" > "extract/$page"; done
I did something similar when experimenting, but MoinLight can be made to
process the current revisions of a wiki data directory using the
appropriate input directory type. Its original purpose was to process a
simple directory hierarchy of pages, but obviously, Moin has a directory
structure which contains page revisions and current revision
information.
Unfortunately, I didn't have time to provide guidance on running the
moinconvert tool on the data dumps, but you can use the --input-dir-type
option with a value of moindirectory to have the tool effectively
provide a converted dump of the current state of a wiki in a publishable
form. Something like this should work:
./moinconvert --input-dir data/pages --input-dir-type moindirectory
--output-dir html_pages --wikiwords --macros --all
By default, HTML output is produced in a directory hierarchy for
publishing by a Web server, since this was the motivating use-case for
the tool. But you can also obtain a directory hierarchy with appropriate
links that can be viewed in the local filesystem by adding the following
option:
--document-index index.html
Using the moindirectory input type also allows attachments to be
incorporated into the output.
Of course, you can choose a different output format. Quite useful is the
pretty output type which produces document trees for all files:
--output-type pretty
This is like the --tree option which is really just for single
documents. You can then use Unix tools to fairly easily search the
output for things like macros. Unfortunately, I also didn't have time to
provide guidance on that, either, but I wrote some scripts to determine
macro usage, for instance.
Converting them was very fast (0m0.818s). I might try the whole corpus
later.
The only real disadvantage of having the tool do the extraction of
revisions is that it isn't written to support parallel processing,
whereas you can write a shell pipeline to invoke the tool on individual
files with xargs and its multiprocessing option, thus making the
translation go a lot faster.
Quoting --help (--wikiwords):
unlike Moin, bare wikiwords do not produces links with this tool
I think it would be *very* useful if they did. Wiki.debian.org relies
heavily on WikiWords.
I'll probably extend the tool to support them, but such support won't be
enabled by default. I obviously acknowledge that wikis tend to rely on
them due to Moin's default behaviour.
Paul
P.S. Wikiwords are an annoying hangover from the early days of wikis,
presumably introduced by Ward Cunningham himself. Not only do they
encourage bizarre style in documents if not accompanied by explicit link
labels, but inadvertent wikiwords - often things like product names -
obviously cause links to be introduced, and then people have
traditionally liked to create low-information pages for those links,
cluttering any given wiki with what I might charitably call "guff". It
is rather telling that Wikipedia doesn't support them, but I also moved
away from them for my own documentation purposes, not least because I
was fed up prefixing such words with "!" and generally polluting the
source text.
P.P.S. I don't have a great deal of time to support this tool at the
moment. I'll try and give advice, but I have more urgent matters to
attend to.