Hi!

I am building a wiki do document and collect information on the Palme case 
(https://en.wikipedia.org/wiki/Assassination_of_Olof_Palme). The case was 
closed two years ago and since then a lot of documents have been released.  The 
police investigation is one of the three largest in world history. The complete 
material is around 60 000 documents consisting of around 1 000 000 pages, of 
which we have roughly 5%.
My wiki, https://wpu.nu is collecting these documents, OCRs them with Google 
Cloud Vision, and publishes them using the Proofread Page extension. This is 
done using a python script running on the server accessing the wiki via the 
API. Some users are also writing "regular pages" and helps me sort through the 
material and proof read it. This bit works very well for the most part.

The wiki is running on a bare metal server with AMD Ryzen 5 3600 6-Core 
Processor (12 logical cores) and 64Gb of RAM. MariaDB (10.3.34) is used for the 
database. I have used Elastic for SMW data and fulltext search but have been 
switching back and forth in my debugging efforts.

>From the investigation we have almost 60 000 sections (namespace Uppslag), 22 
>000 chapters (namespace Avsnitt) and 6000 documents (namespace Index).  The 
>documents and the Index namespace are handled by the Proofread Page extension. 
>I have changed the PRP templates to suit the annotation and ui needs of wpu. 
>For instance, each Index page has a semantic attribute pointing out which 
>section it is attached to. Between all these pages there are semantic links 
>that represents relations between the sections. This can be for instance the 
>relation between a person and a specific gun, or an organisation or place.

Each namespace is rendered with its corresponding template, which in turn 
includes several other templates. The templates renders the ui but also 
contains a lot of business logic, that adds to categories, sets semantic data 
etc. 

I will use an example to try to explain it better. This is an example of a 
section page: https://wpu.nu/wiki/Uppslag:E13-00 which in the header shows 
information such as date, document number and relations to other pages. Below 
that is the meta-information from the semantic data in the corresponding Index 
page, followed by the pages of that document.
The metadata of the page is entered using Page Forms and rendered using the 
Uppslag_visning template. I use the Uppslag template to set a few variables 
that are used a lot in the Uppslag_visning template. Uppslag_visning also sets 
the page semantic data and categories. A semantic query is used to find if 
there is a corresponding Index page, if so a template is used to render its 
metadata. Another semantic query is used to get the pages of the Index and 
render them using template calls in the query.

Oh, and the skin is custom. It is based off of the Pivot-skin but changed 
extensively.

I have run in to a few problems which have led me to question if MediaWiki + 
SMW is the right tools for the job, question my sanity and the principle of 
cause and effect :) It is not a specific problem or bug as such.

Naturally, I often make changes to templates used by the 60 000 section pages. 
This queues a lot of refreshLinks-jobs in the job queue - initially taking a 
few hours to clear. I run the jobs as a service and experimented with the 
options to runJobs.php to get a good usage of the resources. I optimized the 
templates to reduce the resources needed for each job. (e.g. using 
proxy-templates to instanciate "variables" to reduce the number of identical 
function calls, save calculated data in semantic properties etc). This helped a 
little.

I noticed that a large portion of the refreshLinks-jobs failed with issues 
locking the Localization Cache table. At that moment i had the runJobs.php 
--maxjobs parameter set quite high, like 100-500 and --procs around 16 or 32. I 
lowered the --maxjobs to around 5 and the problem seemed solved. CPU 
utilization went down and also iowait. The job queue still took a very long 
time to clear. Looking at the mysql process list i found that a lot of time was 
spent by jobs trying to delete the Localisation Cache.

I switched to 'array' (and tested 'files' as well). This speed up the queue 
processing a bit but caused various errors. Sometimes the localisation cache 
data was read as a int(1) and sometimes the data seemed truncated. Looking at 
the source I found that the cache file writes and reads was not protected by 
any mutex or lock. That caused one job to read the LC file when another was 
writing it, causing a truncated file to be read. I implemented locks and 
exception handling in the LC-code for jobs to recover should they read 
corrupted data. I also mounted the $IP/cache dir on a ram-disk. The jobs now 
went through without LC-errors and a bit faster, but...

MediaWiki must be one of the most battle tested software systems written by 
man. How come they forgot to lock files that are used in a concurrent setting? 
I must be doing something wrong here? 

Should the jobs be run serially? Why is the LC-cleared for each job when the 
cache should be clean? Maybe there is some kind of development and deployment 
path that circumvents this problem?

I could have lived with that the wiki lagged an hour or so, but I also 
experience data inconsistency. For instance sometimes the query for the Index 
finds an index, but the query for Pages finds nothing, resulting in that the 
metadata is filled in on the page but no document is shown. Sometimes when i 
purge a page the document is shown, and if I purge again, it is gone. Data from 
other sections in the same chapter is shown wrongly and various sequences of 
refresh and purge may fix it. The entire wiki is plagued with this type of 
inconsistency making it a very unreliable source of information for its users.

Any help, tips and pointers would be greatly appreciated.

A first step should probably be to get to a consistent state.

Best regards,
Simon

PS: I am probably running a refreshLinks or rebuildData when you visit the 
wiki, so the info found there might vary.

__________
Versions:
Ubuntu 20.04 64bit / Linux 5.4.0-107-generic
MediaWiki       1.35.4
Semantic MediaWiki      3.2.3
PHP     7.4.3 (fpm-fcgi)
MariaDB 10.3.34-MariaDB-0ubuntu0.20.04.1-log
ICU     66.1
Lua     5.1.5
Elasticsearch   6.8.23
_______________________________________________
MediaWiki-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/mediawiki-l.lists.wikimedia.org/

Reply via email to