wrote:
Hi Again !
I spent a few days to implement:
http://www.destin.be/CAFE
It is "La Clé", a dictionnary of litterary devices with entry headin
gs
being terms with a lot of punctuation and accented letters.
First, all my CONGRATULATIONS for departing from the WikiName
CamelCase
paradygm: at Poison Centre, the chemical names are really bad when
camelCased and for "La Clé" it was simply not an option. It seems to
be
still a (difficult) work in progress and please find below my
contribution for debugging this.
Suggestion: backward compatibility (camelCasing) could be a
configurable
property. This would prevent having complex code to maintain both
approaches in parrallel. A wiki would then be either "traditional" or
"unrestricted" (for unrestricted names). A conversion program could
allow to go from one to the other (for those who need it): this
program
would probably not be lossless when going from "unrestricted" to
"traditional".
So, for this conversion, I made many tests, changed the data where
acceptable and (minimaly) changed JSPWiki when I had to: I provide
herewith the source code for 2.6.1. Changes are very punctual: with
WinMerge, one sees what is happening in seconds.
I still have problems with page renaming and page names in forms so
the
herewith corrections are not sufficient for a release.
The final conversion of imported ASCII characters (within names)
that I
implement is:
' ': ONE space is kept (sequences of two or more spaces becomes only
one)
Spaces at beginning and end of the name are completely removed.
'.': ONE dot is kept (sequences of two or more dots becomes only one:
this to protect Windows which does not like ".." in file names)
Dots at the end of the name must be completely removed (This prevents
Windows to badly manage a file name containing "..txt").
'[': '(' : square brackets are links markup delimiters...
']': ')'
'|': "=" : vertical bars are delimiting parts of a link definition.
They
are replaced by "="
"'": 0xE2,0x80,0x99 : The ASCII quote is replaced by the UTF-8
apostrophe (like the one MS-Word generates in french texts). An help
for
this will be necessary in the Wiki Page Editors.
':': "=" : this is the InterWiki prefix delimiter. I replace it by "="
for now but I would prefer to have ":<space>" accepted in some
future...
(some code already provided for this)
'/': Introduces an attachment and it is better not to use it (for
now: I
began to add support to accept /<space> within a name)
'\': is systematically removed. Why?
'`' (0x60): is systematically removed. Why?
'~' (0x7E): is systematically removed. Why?
'!': is systematically removed. Why?
The main changes I had to do to JSPWiki was to make it accept ALL non
ASCII characters ( code >= 0x80 )in page names (not only the
alphabetic
ones).
This occurs into:
1) In TranslatorReader.java, method cleanLink:
for( int i = 0; i < clean.length(); i++ )
{
char ch = clean.charAt(i);
if( !(ch >= 0x80 || // All non ASCII are allowed!!!
Character.isLetterOrDigit(ch) ||
PUNCTUATION_CHARS_ALLOWED.indexOf(ch) != -1 ))
{
clean.deleteCharAt(i);
--i; // We just shortened this buffer.
}
}
2) In MarkupParser.java, method cleanLink:
//
// Check if it is allowed to use this char, and capitalize, if
necessary.
//
if( ch >= 0x80 || // All non ASCII are allowed!!!
Character.isLetterOrDigit( ch ) ||
allowedChars.indexOf(ch) != -1 )
{
// Is a letter
if( isWord ) ch = Character.toUpperCase( ch );
clean.append( ch );
isWord = false;
}
else
{
isWord = true;
}
Two bugs where corrected when encoding UTF-8 in
DefaultURLConstructor.java:
public String parsePage( String context,
HttpServletRequest request,
String encoding )
throws UnsupportedEncodingException
{
request.setCharacterEncoding( encoding );
String pagereq = request.getParameter( "page" );
if( context.equals(WikiContext.ATTACH) )
{
pagereq = parsePageFromURL( request, encoding );
}
!!!! else pagereq = TextUtil.urlDecode( pagereq, encoding ); !!!! I am
unsure if this is working when editing a page name within a POSTED
form ???
log.debug("parsePage: "+encoding+":"+pagereq);
return pagereq;
}
!!! AND ALSO, below, in parsePage, I uncommented the line:
name = TextUtil.urlDecode( name, encoding );
I notice a few discrepanties between the different classes working
with
page names:
- PageRenamer.java:
private static final String LONG_LINK_PATTERN =
"\\[([\\w\\s]+\\|)?([\\w\\s\\+-/\\?&;@:=%\\#<>$\\.,\\(\\)'\\*]+)?\\]";
In MarkupParser.java:
public static final String PUNCTUATION_CHARS_ALLOWED = " ()&+,-=._$";
!!! NOT FULLY IN LINE WITH PageRenamer.java WHICH ALLOWS: "
()&+,-=._$/?;@:%#<>'*" ( space is in \s and _ in \w )
!!! WHY DO ME FORBID OTHER CHARACTERS THAN "|", ":" (":<space>" should
be allowed), "]") ?
I still notice some problems with renaming and with forms where
UTF-8 is decoded in ISO 8859.
- PageRenamer.java, I have problems with renaming "long" names
references: "Null link while trying to rename! Culprit text is ..."
in com.ecyrd.jspwiki.PageRenamer.replaceLongLinks(), line 330
That is all for today!
THANKS FOR ALL!
Christophe