Hi Janne!

The purpose is simple. Currently, you have code like:

if( Character.isLetterOrDigit(ch) ||
PUNCTUATION_CHARS_ALLOWED.indexOf(ch) != -1 ))

This allows UTF-8 letters and digits but no UTF-8 punctuation. Reason I allowed ch 
>= 0x80
in different places. Europeans knows that if you cut and paste MS-Word text 
within a form textarea,
you have many special UTF-8 punctuation. And if it is a WikiName, problems 
begin...

The other point is that PUNCTUATION_CHARS_ALLOWED is changing in different 
JSPWiki modules.

Finaly, multiple spaces within a "long" name should be removed. Same thing for 
multiple dots.
Final dots are also generating names like "xxxx..txt" which Windows do not like 
so much.

This is the reason of the modifications I made.

I see from the different reactions that I have to go through JIRA... I will do 
it as soon as possible.

Have a nice week-end!

Christophe



Janne Jalkanen a écrit :
I don't understand these at all either.  JSPWiki is and has been for
the past four years or so, fully UTF-8 compatible.   If you're not,
it's between you and your servlet container.

Camelcases are already a configurable property (and, I seem to recall,
we ship with camelcase OFF by default...)

/Janne

On Fri, Feb 15, 2008 at 08:11:14AM -0500, Andrew Jaquith wrote:
Christophe --

I do not understand the purpose of this e-mail. Are these bugs you are trying to correct? Or proposed enhancements? In either case, JIRA is the right place to file them.

Andrew

On Feb 15, 2008, at 3:50, Christophe Dupriez <[EMAIL PROTECTED]
wrote:
Hi Again !

I spent a few days to implement:
http://www.destin.be/CAFE

It is "La Clé", a dictionnary of litterary devices with entry headin gs
being terms with a lot of punctuation and accented letters.

First, all my CONGRATULATIONS for departing from the WikiName CamelCase
paradygm: at Poison Centre, the chemical names are really bad when
camelCased and for "La Clé" it was simply not an option. It seems to be
still a (difficult) work in progress and please find below my
contribution for debugging this.

Suggestion: backward compatibility (camelCasing) could be a configurable
property. This would prevent having complex code to maintain both
approaches in parrallel. A wiki would then be either "traditional" or
"unrestricted" (for unrestricted names). A conversion program could
allow to go from one to the other (for those who need it): this program
would probably not be lossless when going from "unrestricted" to
"traditional".

So, for this conversion, I made many tests, changed the data where
acceptable and (minimaly) changed JSPWiki when I had to: I provide
herewith the source code for 2.6.1. Changes are very punctual: with
WinMerge, one sees what is happening in seconds.

I still have problems with page renaming and page names in forms so the
herewith corrections are not sufficient for a release.

The final conversion of imported ASCII characters (within names) that I
implement is:

' ': ONE space is kept (sequences of two or more spaces becomes only one)
Spaces at beginning and end of the name are completely removed.

'.': ONE dot is kept (sequences of two or more dots becomes only one:
this to protect Windows which does not like ".." in file names)
Dots at the end of the name must be completely removed (This prevents
Windows to badly manage a file name containing "..txt").

'[': '(' : square brackets are links markup delimiters...
']': ')'

'|': "=" : vertical bars are delimiting parts of a link definition. They
are replaced by "="
"'": 0xE2,0x80,0x99 : The ASCII quote is replaced by the UTF-8
apostrophe (like the one MS-Word generates in french texts). An help for
this will be necessary in the Wiki Page Editors.

':': "=" : this is the InterWiki prefix delimiter. I replace it by "="
for now but I would prefer to have ":<space>" accepted in some future...
(some code already provided for this)

'/': Introduces an attachment and it is better not to use it (for now: I
began to add support to accept /<space> within a name)

'\': is systematically removed. Why?
'`' (0x60): is systematically removed. Why?
'~' (0x7E): is systematically removed. Why?
'!': is systematically removed. Why?

The main changes I had to do to JSPWiki was to make it accept ALL non
ASCII characters ( code >= 0x80 )in page names (not only the alphabetic
ones).

This occurs into:

1) In TranslatorReader.java, method cleanLink:
for( int i = 0; i < clean.length(); i++ )
{
char ch = clean.charAt(i);

if( !(ch >= 0x80 || // All non ASCII are allowed!!!
Character.isLetterOrDigit(ch) ||
PUNCTUATION_CHARS_ALLOWED.indexOf(ch) != -1 ))
{
clean.deleteCharAt(i);
--i; // We just shortened this buffer.
}
}

2) In MarkupParser.java, method cleanLink:
//
// Check if it is allowed to use this char, and capitalize, if necessary.
//
if( ch >= 0x80 || // All non ASCII are allowed!!!
Character.isLetterOrDigit( ch ) ||
allowedChars.indexOf(ch) != -1 )
{
// Is a letter

if( isWord ) ch = Character.toUpperCase( ch );
clean.append( ch );
isWord = false;
}
else
{
isWord = true;
}


Two bugs where corrected when encoding UTF-8 in DefaultURLConstructor.java:
public String parsePage( String context,
HttpServletRequest request,
String encoding )
throws UnsupportedEncodingException
{
request.setCharacterEncoding( encoding );
String pagereq = request.getParameter( "page" );

if( context.equals(WikiContext.ATTACH) )
{
pagereq = parsePageFromURL( request, encoding );
}
!!!! else pagereq = TextUtil.urlDecode( pagereq, encoding ); !!!! I am
unsure if this is working when editing a page name within a POSTED form ???
log.debug("parsePage: "+encoding+":"+pagereq);

return pagereq;
}
!!! AND ALSO, below, in parsePage, I uncommented the line:
name = TextUtil.urlDecode( name, encoding );

I notice a few discrepanties between the different classes working with
page names:
- PageRenamer.java:
private static final String LONG_LINK_PATTERN =
"\\[([\\w\\s]+\\|)?([\\w\\s\\+-/\\?&;@:=%\\#<>$\\.,\\(\\)'\\*]+)?\\]";
In MarkupParser.java:
public static final String PUNCTUATION_CHARS_ALLOWED = " ()&+,-=._$";
!!! NOT FULLY IN LINE WITH PageRenamer.java WHICH ALLOWS: "
()&+,-=._$/?;@:%#<>'*" ( space is in \s and _ in \w )
!!! WHY DO ME FORBID OTHER CHARACTERS THAN "|", ":" (":<space>" should
be allowed), "]") ?

I still notice some problems with renaming and with forms where UTF-8 is decoded in ISO 8859.

- PageRenamer.java, I have problems with renaming "long" names references: "Null link while trying to rename! Culprit text is ..."
in com.ecyrd.jspwiki.PageRenamer.replaceLongLinks(), line 330


That is all for today!
THANKS FOR ALL!

Christophe

Reply via email to