On Apr 28, 2005, at 1:40 PM, Dave Pawson wrote:


I think that means the 'end user' shouldn't have to escape his content? If so I agree. logger.info("One < two"); etc.

We are thinking alike then. Your first message had said "Not the applications concern" and " The application should be responsible", I'm guessing that you dropped a NOT from the second sentence.


DP said.
I'd say (contra to some 'laws' :-) that log4j make no assumptions about
downstream usage.
produce (and declare) utf-8 encoding. No more. it's impossible to
predict what people will do with a file?

There are two distinct issues here that I jumbled together.

If the current XMLLayout is used and the encoding of the appender is
not set to UTF-8 or one of the UTF-16's, the resulting file may be
non-wellformed or lose information.
Yes, KISS Principle perhaps? utf-8 or 16 as the two setup options?


The encoding is not attribute of the layout, but WriterAppender which FileAppender and others extend. The current XMLLayout is unsafe when attached to an WriterAppender that is not "UTF-8" or one of the "UTF-16"'s, however there is no way for the layout to determine or change the encoding the writer. The default encoding for the existing WriterAppender derived classes should not be changed since existing use with other appenders (PatternLayout) expect that the output encoding to be in the platform default encoding.



 For example, the default encoding
on Windows platforms is the current Windows code page (for example,
Cp1252 for Western European langauges).  Unless the user explicitly
overrides the default encoding on a file appender, then generated XML
file will become corrupt if any non-ASCII character is output.
Which is why the output needs to be Unicode based?


The scenarios that I described can be avoided if the person building the configuration is aware that they need to specify a Unicode encoding on the appender when they use an XMLLayout. We could make the JavaDoc for XMLLayout much more emphatic that that needs to be done, however that you would get nowhere close to 100% adherence to that recommendation. Since the problem can almost entirely be avoided by use of character entities for non-USASCII characters and the cost would be negligible unless you you were primarily logging in non-European languages in which case the log files would be larger than necessary. If that is really a concern the use of entities could be configured.



  log4j does not attempt to prevent or detect a
mismatch between the encoding required by the XMLLayout and the
encoding in use by the Writer.
Are you talking about the XML serialization here?

Hopefully the other responses have clarified the issue for you. It would be talking about "XML Serialization" as the combination of the WriterAppender and XMLLayout would essentially convert an XML document into a stream of bytes, however it would not be involving a general purpose XML serializer.





If the XMLLayout represented all
characters >= \u0080 as character entities, the catastrophic effects of
a mismatch would be reduced.
It would (or could) still be a guess, if the user uses funny keyboard
shortcuts, explicit European keys etc to produce the messages.
If the mandate is that any text passed to an XMLLayout output must
be in utf-8 then the XML path becomes clear, and no guesses are needed.
If the user messes up, they can expect funny glyphs in their output.
  I think that's a subset of the expectation of a majority of XML
applications.

XMLLayout only works in terms of Java Strings and chars which are defined as UTF-16 code points. The problems come when the string is converted to a byte stream or vice-versa. The UI will take care of converting the keystrokes into characters from log4j perspective it doesn't matter whether a character was generated by someone pressing the "A" key or an 'A' or a '\u0041' in the source file. In the same way an XML processor is required to treat <foo>A</foo> and <foo>&#x41;</foo> identically.





  My feeling is that most configurations
that use XMLLayout are vulnerable to this problem but are either
running on platforms where the default encoding is UTF-8 or have not
encountered messages containing characters between \u0080 and \u00FF.
  I.e. its not really predictable? Hence make it clear, hence
predictable and hence a solid chain can be designed, input to output,
based on utf-8|16

Can't do that. Existing users of the other layouts depend on the encoding being in the platform default encoding.




Since the character and the corresponding character entity are required
to be treated identically by an XML processor, using a character entity
seems to have no downside other than a some increase in file size and
maybe some impact on performance.
Who would you expect to do the conversion from bullet to &#x2022;?
The author, or the XMLLayout code?
  That's where I think it becomes too cumbersome?

It would be inside of XMLLayout.


The second issue is that users may want to attempt to view an XML log with a non-XML aware tool (Notepad, tail, etc) which could assume the current platform encoding instead UTF-8.
  That's an assumption, and the source of lots of trouble IMHO.
If a user selects XML output, they should be able to read the encoding
statement in the declaration and understand what it means.


 If non-USASCII characters
were expressed as character entities, these tools could still be used
without potentially misinterpreting or corrupting the data.
  No XML application is required to output ncr's athough they may
if serializing in an encoding which doesn't support that character.
Hence my simpler proposal above.

As said previously, the XMLLayout cannot easily affect the encoding used by the writer that it is attached to.



Just in the last few hours there was a bug request filed to add a header/footer to the rolling file appender which would be very similar to that request.

If it's doable, the xml declaration, encoding, and document element are a valuable addition IMHO.

regards DaveP



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to