Re: MHonArc and multi-byte characters in HTML

Earl Hood Thu, 04 Oct 2001 12:54:39 -0700

Since it appears there are still some open issues with multi-byte character
support, mainly with Japanese character support, I will not hold up
the 2.5.0 release waiting for a resolution.  Reason: MHonArc still
functions as it has done in the past with respect to the issue.  If
patches do happen to come in before I release 2.5.0, I will try to
include them.  If not, they will be applied to a future release.


I would definitely like to have MHonArc updated to do things properly
with respect to multi-byte characters at some time.  Dealing with
Japanese character data issue currently raised, I'd like patches to be
address the following:

. Existing, correct, functionality is not broken.

. If variability exists, then configurable options are provided to
  the user.  This is achievable for filters by adding argument options
  to control behavior.  Note, if there is a way to "do the right
  thing" automatically, then it should be considered.

. If any patches require the use of non-standard Perl modules (i.e.
  modules that are not included with the standard Perl distribution),
  then the functionality MUST be optional and no perl aborts occur due
  to the failure of requiring a module.  I will consider auto-including
  external modules with MHonArc if such inclusion can be managed
  easily.

. Since I do not Japanese, I can do very little in verifying the
  correctness of any contributions related to Japanese text
  processing.  I'm hoping those qualified to verify the contributions
  will do so.

I'm unsure how to deal with the string clipping issue with respect to
resource variables: e.g. $SUBJECT:72$.  I see this a fundamental issue
with Perl itself since there is no built-in string type that abstracts
this problem (like strings in Java) in a simple and efficient matter,
yet.  An approach that would ignore the problem but make sure nothing
bad happens is to change all default resources settings to not using
the clipping support in resource variables.  Therefore, any clipping
must be explicitly specified under the advisory of the problems that
multi-byte character encodings may cause.  I believe I will go make
this kind of change to default resource settings for v2.5.

When I get time, I'll recheck the status of UTF8 support in Perl.  A
major issue is have conversion support to and from UTF8 character
encodings.  Note, any change to Unicode in MHonArc could ripple through
the entire existing code base and may require a significant rewrite.

--ewh

Re: MHonArc and multi-byte characters in HTML

Reply via email to