Re: Problems with new "long" links, UTF-8, allowed punctuation...

Christophe Dupriez Fri, 15 Feb 2008 09:44:39 -0800

Hi Janne!

The purpose is simple. Currently, you have code like:


if( Character.isLetterOrDigit(ch) ||
PUNCTUATION_CHARS_ALLOWED.indexOf(ch) != -1 ))

This allows UTF-8 letters and digits but no UTF-8 punctuation. Reason I allowed ch 
>= 0x80
in different places. Europeans knows that if you cut and paste MS-Word text 
within a form textarea,
you have many special UTF-8 punctuation. And if it is a WikiName, problems 
begin...

The other point is that PUNCTUATION_CHARS_ALLOWED is changing in different 
JSPWiki modules.

Finaly, multiple spaces within a "long" name should be removed. Same thing for 
multiple dots.
Final dots are also generating names like "xxxx..txt" which Windows do not like 
so much.

This is the reason of the modifications I made.

I see from the different reactions that I have to go through JIRA... I will do 
it as soon as possible.

Have a nice week-end!

Christophe



Janne Jalkanen a écrit :

I don't understand these at all either.  JSPWiki is and has been for
the past four years or so, fully UTF-8 compatible.   If you're not,
it's between you and your servlet container.

Camelcases are already a configurable property (and, I seem to recall,
we ship with camelcase OFF by default...)

/Janne

On Fri, Feb 15, 2008 at 08:11:14AM -0500, Andrew Jaquith wrote:

Christophe --
I do not understand the purpose of this e-mail. Are these bugs you aretrying to correct? Or proposed enhancements? In either case, JIRA isthe right place to file them.
Andrew
On Feb 15, 2008, at 3:50, Christophe Dupriez <[EMAIL PROTECTED]
wrote:
Hi Again !
I spent a few days to implement:
http://www.destin.be/CAFE
It is "La Clé", a dictionnary of litterary devices with entry headings
being terms with a lot of punctuation and accented letters.
First, all my CONGRATULATIONS for departing from the WikiNameCamelCase
paradygm: at Poison Centre, the chemical names are really bad when
camelCased and for "La Clé" it was simply not an option. It seems tobe
still a (difficult) work in progress and please find below my
contribution for debugging this.
Suggestion: backward compatibility (camelCasing) could be aconfigurable
property. This would prevent having complex code to maintain both
approaches in parrallel. A wiki would then be either "traditional" or
"unrestricted" (for unrestricted names). A conversion program could
allow to go from one to the other (for those who need it): thisprogram
would probably not be lossless when going from "unrestricted" to
"traditional".

So, for this conversion, I made many tests, changed the data where
acceptable and (minimaly) changed JSPWiki when I had to: I provide
herewith the source code for 2.6.1. Changes are very punctual: with
WinMerge, one sees what is happening in seconds.
I still have problems with page renaming and page names in forms sothe
herewith corrections are not sufficient for a release.
The final conversion of imported ASCII characters (within names)that I
implement is:
' ': ONE space is kept (sequences of two or more spaces becomes onlyone)
Spaces at beginning and end of the name are completely removed.

'.': ONE dot is kept (sequences of two or more dots becomes only one:
this to protect Windows which does not like ".." in file names)
Dots at the end of the name must be completely removed (This prevents
Windows to badly manage a file name containing "..txt").

'[': '(' : square brackets are links markup delimiters...
']': ')'
'|': "=" : vertical bars are delimiting parts of a link definition.They
are replaced by "="
"'": 0xE2,0x80,0x99 : The ASCII quote is replaced by the UTF-8
apostrophe (like the one MS-Word generates in french texts). An helpfor
this will be necessary in the Wiki Page Editors.

':': "=" : this is the InterWiki prefix delimiter. I replace it by "="
for now but I would prefer to have ":<space>" accepted in somefuture...
(some code already provided for this)
'/': Introduces an attachment and it is better not to use it (fornow: I
began to add support to accept /<space> within a name)

'\': is systematically removed. Why?
'`' (0x60): is systematically removed. Why?
'~' (0x7E): is systematically removed. Why?
'!': is systematically removed. Why?

The main changes I had to do to JSPWiki was to make it accept ALL non
ASCII characters ( code >= 0x80 )in page names (not only thealphabetic
ones).

This occurs into:

1) In TranslatorReader.java, method cleanLink:
for( int i = 0; i < clean.length(); i++ )
{
char ch = clean.charAt(i);

if( !(ch >= 0x80 || // All non ASCII are allowed!!!
Character.isLetterOrDigit(ch) ||
PUNCTUATION_CHARS_ALLOWED.indexOf(ch) != -1 ))
{
clean.deleteCharAt(i);
--i; // We just shortened this buffer.
}
}

2) In MarkupParser.java, method cleanLink:
//
// Check if it is allowed to use this char, and capitalize, ifnecessary.
//
if( ch >= 0x80 || // All non ASCII are allowed!!!
Character.isLetterOrDigit( ch ) ||
allowedChars.indexOf(ch) != -1 )
{
// Is a letter

if( isWord ) ch = Character.toUpperCase( ch );
clean.append( ch );
isWord = false;
}
else
{
isWord = true;
}
Two bugs where corrected when encoding UTF-8 inDefaultURLConstructor.java:
public String parsePage( String context,
HttpServletRequest request,
String encoding )
throws UnsupportedEncodingException
{
request.setCharacterEncoding( encoding );
String pagereq = request.getParameter( "page" );

if( context.equals(WikiContext.ATTACH) )
{
pagereq = parsePageFromURL( request, encoding );
}
!!!! else pagereq = TextUtil.urlDecode( pagereq, encoding ); !!!! I am
unsure if this is working when editing a page name within a POSTEDform ???
log.debug("parsePage: "+encoding+":"+pagereq);

return pagereq;
}
!!! AND ALSO, below, in parsePage, I uncommented the line:
name = TextUtil.urlDecode( name, encoding );
I notice a few discrepanties between the different classes workingwith
page names:
- PageRenamer.java:
private static final String LONG_LINK_PATTERN =
"\\[([\\w\\s]+\\|)?([\\w\\s\\+-/\\?&;@:=%\\#<>$\\.,\$\$'\\*]+)?\\]";
In MarkupParser.java:
public static final String PUNCTUATION_CHARS_ALLOWED = " ()&+,-=._$";
!!! NOT FULLY IN LINE WITH PageRenamer.java WHICH ALLOWS: "
()&+,-=._$/?;@:%#<>'*" ( space is in \s and _ in \w )
!!! WHY DO ME FORBID OTHER CHARACTERS THAN "|", ":" (":<space>" should
be allowed), "]") ?
I still notice some problems with renaming and with forms whereUTF-8 is decoded in ISO 8859.
- PageRenamer.java, I have problems with renaming "long" namesreferences: "Null link while trying to rename! Culprit text is ..."
in com.ecyrd.jspwiki.PageRenamer.replaceLongLinks(), line 330


That is all for today!
THANKS FOR ALL!

Christophe

Re: Problems with new "long" links, UTF-8, allowed punctuation...

Reply via email to