After spending more than a day analysing the code and trying to find a
way to arrive at a true utf-8 implementation for page names I like to
share some conclusions:
I discovered that there are numerous patches in many functions to help
facilitate url decoding and coding (what you called % encoding). Some
of these seem to undo what was done before, only to be done later
again. The dependencies are quite complex, and it often surprised me
to find another encoding or decoding patch in a function.
I come to the conclusion that it is a far from easy matter to
implement true utf-8 page names (percent encoded utf-8 characters in
page and group names). I think it is premature to push BW to a version
3 release. I think it would be wiser to address the whole issue of
utf-8 in urls and page names first and not in the present manner of
applying patches here and there. Only that way can we arrive at code
which is both robust and easier to maintain.
I given a few suggestion to fix this and that in this thread. But in
my endeavours yesterday to go further and resolve still outstanding
issues regards to utf-8 in page names, links and search templates, I
was beginning to fix not just some code in a function here and there,
but in multiple functions. Cleaning up the process in one place was
leading me to discover it broke in several others, because of the
dependencies to some patching.
This is a complex matter, and I do not have final solutions. But here
is a glimpse of my "research":
I suggested a fix to introduce urldecode in function BOLTdisplayFmt()
for {+pn} items. This was to fix the issue reported by Linly.
But to get clean output from that function, urldecode needs to be
applied also to $item, for {+p}.
One that is done the link markup function BOLTMlinks() throws problems
with utf-8 characters.
Analysing this further reveals that function BOLTpageshortcuts() uses
urlencode on links (via BOLTutf2url()).
Eliminating this encoding i get nice utf-8 characters in link markup,
but BOLTMlinks() cannot handle these.
Here i quote some code from the function, as i left it yesterday in test mode:
echo "<br />".$link;
$link = BOLTpageshortcuts($link);
echo " ==> ".$link;
# if (!preg_match('/^[-_a-zA-Z0-9\.%]+(\#|\&|$)/',
$link)) return
BOLTtranslate(BOLTinfoVar('site.messages', 'invalid_link', 'Invalid
link.'));
# if ($link != BOLTlowercase($x) && $label == $x) $label
= BOLTurl2utf($link);
# if (! BOLTexists($link) && strpos($link, '&') === false)
$missingPage = $BOLTmissingMark;
if ($label == '+') $label = BOLTvars("$link:title");
echo "<br />".$link." ==> ".$label;
I commented three lines, because they filed the utf-8 characters in links.
1. The first is restricting links to alphanumeric plus underscore and
hyphen, so that does not help utf-8 characters.
2. The second messes up utf-8 characters with BOLTlowercase($x) i
think (I have not analysed that function). If you got utf-8 characters
you need to be extra careful in doing things like lowercase and
uppercase etc., you cannot assume to have only ASCII characters any
longer.
3. The third fails utf-8 characters I assume because of strpos($link,
'&') or the BOLTexists($link).(not analysed yet)
4. The fourth I was testing on, as i found that
BOLTvars("$link:title") did not return the title from a page name
correctly.
For testing I mostly used two pages created with Chinese characters as
page names (urlencoded), and the search markup with custom fmt= Linly
mentioned. The echos added return for one link delivered by
[(search sort=lastmodified count=20 fmt="* [[{+p}|+]] [{+p1}]")]
一.許功蓋 ==> 一.許功蓋 (first and second echo in code above)
一.許功蓋 ==> 許��蓋 (third echo, label gets mangled)
Anyway, I got this far. I can check the utf-8 characters via echo
statements, see where in the process they get broken etc.
But it clerly needs to address multiple problems with urlencoding and
decoding in multiple functions. I probably just scraped the iceberg
here, or hopefully it turns out not too big a berg.... but especially
the link markup function gives me a headache, as there are so many
exceptions in the code.
If you cannot see Chinese characters in your email or browser, you
will not know what i am talking about.
To develop the code for allowing utf-8 characters for page names you
need to see these.
Cheers, thanks for listening!
~Hans
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"BoltWire" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/boltwire?hl=en
-~----------~----~----~----~------~----~------~--~---