Unfortunately there is not an easy solution to this problem, see below.

Leandro Fanzone writes:
Hello, I have an installation of pmwiki on a Fedora Core 4 server, and I decided to migrate it to Ubuntu 12.04. As I did not want to install pmwiki again, I just copied /var/www to the new machine and installed Apache + PHP. As a result, some pages that had titles with Spanish letters (á, ñ, etc.) cannot be accessed anymore. I see that the files do exist (albeit they have the special letters changed somehow) but when I try to open those pages pmwiki cannot find them. For example: a page called "Documentación" exists in the filesystem as "Documentaci?n", but pmwiki tries to access it as "DocumentaciN". It seems an encoding problem, apparently the contents are stored in Latin1 (ISO-8859-1), and in the filenames sometimes the special letters were changed with ? and sometimes they keep the Latin1 letter, but for some reason pmwiki does not generate the same filename as before to access them. I am completely lost, I don't know if this is a configuration problem of PHP, of Apache, of the LANG variable...

This is likely a problem of the filesystem encoding (charset). It is possible that the older server had a different filesystem encoding than the new one.

A charset (character set) is set of rules defining the byte or bytes used to represent different letters, characters and symbols. Different charsets generally use the same bytes for the plain Roman/Latin letters (ASCII) and the most used punctuation symbols, but for example international letters like "ó" may be "tied" to different bytes in different charsets. If your filenames contain such characters, there is no guarantee that you'll be able to copy them without errors from one filesystem to another.

PmWiki (actually PHP) doesn't care much about the charset, it tries to process just the stream of bytes, whatever the charset.

So if your wiki content is in Latin1 and PmWiki creates a link to a page "Documentación", it will look for a filename which is the stream of bytes with positions 68, 111, 99, 117, 109, 101, 110, 116, 97, 99, 105, 245, 110, where the "ó" character is byte number 245.

If in your directory there is no such filename, PmWiki will show a link as if the page doesn't exist.

The Unicode/UTF-8 charset defines "ó" as two consecutive bytes, 195 and 179, which are obviously not the same.

When you copy files from one filesystem to another, there may be two cases - either (A) your copy program is aware of the two charsets and recodes the actual letters to the correct byte positions, or (B) it is not aware of the charsets and tries to copy the files and tells the new filesystem "this file is named this string of bytes: 68, 111, 99, 117, 109, 101, 110, 116, 97, 99, 105, 245, 110" which (B1) may or (B2) may not be accepted by the new filesystem -- eg. that stream of bytes is not valid UTF-8.

In case of (A) you'll be able to see the correct filenames when you browse your filesystem, but PmWiki may be unable to find the files as it expects different byte streams/positions.

In case of (B1) PmWiki should be able to find its filenames and it should work like before, but when you browse your filesystem, you may see weird characters.

In case of (B2) neither you, nor PmWiki see the correct filenames with international characters. It looks as if you are in this case.

Note, Pagelists/searches use a different approach than links. A link to a page asks if there is such a file, while a pagelist/search will list all files in the wiki.d directory and will try to process them - if a file is named "Documentaci?n", the "?" character is not allowed in a pagename so PmWiki tries to deduce an allowed pagename and it can list "DocumentaciN".

I think I can just change every filename to match pmwiki,

Try with a 1-2 files first to see if it works, because you'll have the (A) case above and PmWiki may still not be able to locate them.

but on one hand that implies a lot of work, and on the other, the titles that has special characters are changed as well, which looks horrible.

What does "looks horrible" mean? If you rename a file to something that looks OK in the filesystem, PmWiki may be able to access it and will try to show these bytes in the Latin1 charset. If the filesystem charset is UTF-8, pmwiki will show "Documentación" because the bytes 195 and 179 ("ó" in UTF-8) are the characters "Ã" and "³" in Latin1.

Some wiki admins restrict pagenames and filenames to ASCII characters, which are on the same byte positions in most charsets. Then the page is named "Documentacion" and there is a directive (:title Documentación:) in it so that it displays correctly. This is generally more migration-proof than allowing all international characters.

There is a recipe that converts all links to the correct plain letters, see
  http://www.pmwiki.org/wiki/Cookbook/ISO8859MakePageNamePatterns

If you want to go this way, you just write a small bash script on the old server (!!BACKUP. Your. Files. Before!!) that will rename the files to ascii characters: something like this:

 for filename in * ; do \
   newfilename=`echo $filename | \
     iconv -f iso8859-1 -t ascii//TRANSLIT -c -`; \
   echo "$filename -> $newfilename"; \
 done

This will just show you if and how your filenames would be renamed. If you are OK with this, change the script it to actually rename the files.

Then install the recipe ISO8859MakePageNamePatterns and test if the wiki works on the old server. If it does, place (:title Correct title:) in the pages where the accents were lost, and copy the wiki.d directory and local/config.php to the new server.

Another note: the encoding of the config.php file also matters - if your wiki is in iso8859-1, save your file on that encoding and not, eg. UTF-8. You must use a text editor allowing you to select the encoding of the files. See http://www.pmwiki.org/wiki/PmWiki/LocalCustomizations#encoding .

Good luck,
Petko

_______________________________________________
pmwiki-users mailing list
[email protected]
http://www.pmichaud.com/mailman/listinfo/pmwiki-users

Reply via email to