On Apr 28, 2005, at 1:40 PM, Dave Pawson wrote:
I think that means the 'end user' shouldn't have to escape his content? If so I agree. logger.info("One < two"); etc.
We are thinking alike then. Your first message had said "Not the applications concern" and " The application should be responsible", I'm guessing that you dropped a NOT from the second sentence.
DP said.
I'd say (contra to some 'laws' :-) that log4j make no assumptions about
downstream usage.
produce (and declare) utf-8 encoding. No more. it's impossible to
predict what people will do with a file?
There are two distinct issues here that I jumbled together.
If the current XMLLayout is used and the encoding of the appender is not set to UTF-8 or one of the UTF-16's, the resulting file may be non-wellformed or lose information.Yes, KISS Principle perhaps? utf-8 or 16 as the two setup options?
The encoding is not attribute of the layout, but WriterAppender which FileAppender and others extend. The current XMLLayout is unsafe when attached to an WriterAppender that is not "UTF-8" or one of the "UTF-16"'s, however there is no way for the layout to determine or change the encoding the writer. The default encoding for the existing WriterAppender derived classes should not be changed since existing use with other appenders (PatternLayout) expect that the output encoding to be in the platform default encoding.
For example, the default encoding on Windows platforms is the current Windows code page (for example, Cp1252 for Western European langauges). Unless the user explicitly overrides the default encoding on a file appender, then generated XML file will become corrupt if any non-ASCII character is output.Which is why the output needs to be Unicode based?
The scenarios that I described can be avoided if the person building the configuration is aware that they need to specify a Unicode encoding on the appender when they use an XMLLayout. We could make the JavaDoc for XMLLayout much more emphatic that that needs to be done, however that you would get nowhere close to 100% adherence to that recommendation. Since the problem can almost entirely be avoided by use of character entities for non-USASCII characters and the cost would be negligible unless you you were primarily logging in non-European languages in which case the log files would be larger than necessary. If that is really a concern the use of entities could be configured.
Are you talking about the XML serialization here?log4j does not attempt to prevent or detect a mismatch between the encoding required by the XMLLayout and the encoding in use by the Writer.
Hopefully the other responses have clarified the issue for you. It would be talking about "XML Serialization" as the combination of the WriterAppender and XMLLayout would essentially convert an XML document into a stream of bytes, however it would not be involving a general purpose XML serializer.
If the XMLLayout represented all
characters >= \u0080 as character entities, the catastrophic effects of
a mismatch would be reduced.It would (or could) still be a guess, if the user uses funny keyboard shortcuts, explicit European keys etc to produce the messages. If the mandate is that any text passed to an XMLLayout output must be in utf-8 then the XML path becomes clear, and no guesses are needed. If the user messes up, they can expect funny glyphs in their output. I think that's a subset of the expectation of a majority of XML applications.
XMLLayout only works in terms of Java Strings and chars which are defined as UTF-16 code points. The problems come when the string is converted to a byte stream or vice-versa. The UI will take care of converting the keystrokes into characters from log4j perspective it doesn't matter whether a character was generated by someone pressing the "A" key or an 'A' or a '\u0041' in the source file. In the same way an XML processor is required to treat <foo>A</foo> and <foo>A</foo> identically.
My feeling is that most configurations that use XMLLayout are vulnerable to this problem but are either running on platforms where the default encoding is UTF-8 or have not encountered messages containing characters between \u0080 and \u00FF.I.e. its not really predictable? Hence make it clear, hence predictable and hence a solid chain can be designed, input to output, based on utf-8|16
Can't do that. Existing users of the other layouts depend on the encoding being in the platform default encoding.
Since the character and the corresponding character entity are required
to be treated identically by an XML processor, using a character entity
seems to have no downside other than a some increase in file size and
maybe some impact on performance.Who would you expect to do the conversion from bullet to •? The author, or the XMLLayout code? That's where I think it becomes too cumbersome?
It would be inside of XMLLayout.
The second issue is that users may want to attempt to view an XML log with a non-XML aware tool (Notepad, tail, etc) which could assume the current platform encoding instead UTF-8.That's an assumption, and the source of lots of trouble IMHO. If a user selects XML output, they should be able to read the encoding statement in the declaration and understand what it means.
If non-USASCII characters were expressed as character entities, these tools could still be used without potentially misinterpreting or corrupting the data.No XML application is required to output ncr's athough they may if serializing in an encoding which doesn't support that character. Hence my simpler proposal above.
As said previously, the XMLLayout cannot easily affect the encoding used by the writer that it is attached to.
Just in the last few hours there was a bug request filed to add a header/footer to the rolling file appender which would be very similar to that request.
If it's doable, the xml declaration, encoding, and document element are a valuable addition IMHO.
regards DaveP
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
