Re: Writing log in xml format

Dave Pawson Thu, 28 Apr 2005 11:40:30 -0700

On Wed, 2005-04-27 at 15:14 -0500, Curt Arnold wrote:
> On Apr 27, 2005, at 1:22 PM, Dave Pawson wrote:
> 
> > On Wed, 2005-04-27 at 13:03 -0500, Curt Arnold wrote:
> >> I'm not fond of the CDATA sections either.   Since the XMLLayout is 
> >> not
> >> aware of the encoding of the writer, it does not know when to create
> >> character entities.
> > Not the applications concern is my response.
> >   The application should be responsible for escaping those characters
> >   that XML has declared that it doesn't like (&lt; and &amp;....
> > OK and [[ => [&#x5A;
> > Anything else the author should transpose into numerical character
> > entities.
> >
> 
> The first two sentences appear contradict each other.   The body of 
> code calling logger.info() et al should not escape its messages so they 
> are safe for a naive XML serialization.


I think that means the 'end user' shouldn't have to escape his content?
If so I agree. 
logger.info("One < two"); etc.

>  For example, the message may 
> be routed to both a ConsoleAppender with a non-XML layout where a 
> literal '<' would be appropriate and FileAppender with an XML layout 
> where &lt; would be appropriate.

I'd suggest that it is the xmllayout that escapes the three characters
needing to be escaped. That avoids the cdata sections.

I hope that isn't contradictory?

> 
 DP said.
> > I'd say (contra to some 'laws' :-) that log4j make no assumptions about
> > downstream usage.
> > produce (and declare) utf-8 encoding. No more. it's impossible to
> > predict what people will do with a file?
> 
> There are two distinct issues here that I jumbled together.
> 
> If the current XMLLayout is used and the encoding of the appender is 
> not set to UTF-8 or one of the UTF-16's, the resulting file may be 
> non-wellformed or lose information. 
Yes, KISS Principle perhaps? utf-8 or 16 as the two setup options?

>  For example, the default encoding 
> on Windows platforms is the current Windows code page (for example, 
> Cp1252 for Western European langauges).  Unless the user explicitly 
> overrides the default encoding on a file appender, then generated XML 
> file will become corrupt if any non-ASCII character is output.
Which is why the output needs to be Unicode based?

>   log4j does not attempt to prevent or detect a 
> mismatch between the encoding required by the XMLLayout and the 
> encoding in use by the Writer. 
Are you talking about the XML serialization here?

>  If the XMLLayout represented all 
> characters >= \u0080 as character entities, the catastrophic effects of 
> a mismatch would be reduced.
It would (or could) still be a guess, if the user uses funny keyboard
shortcuts, explicit European keys etc to produce the messages.
If the mandate is that any text passed to an XMLLayout output must
be in utf-8 then the XML path becomes clear, and no guesses are needed.
If the user messes up, they can expect funny glyphs in their output.
  I think that's a subset of the expectation of a majority of XML
applications.

>   My feeling is that most configurations 
> that use XMLLayout are vulnerable to this problem but are either 
> running on platforms where the default encoding is UTF-8 or have not 
> encountered messages containing characters between \u0080 and \u00FF.  
  I.e. its not really predictable? Hence make it clear, hence
predictable and hence a solid chain can be designed, input to output,
based on utf-8|16

> Since the character and the corresponding character entity are required 
> to be treated identically by an XML processor, using a character entity 
> seems to have no downside other than a some increase in file size and 
> maybe some impact on performance.
Who would you expect to do the conversion from bullet to &#x2022;?
The author, or the XMLLayout code?
  That's where I think it becomes too cumbersome?
> 
> The second issue is that users may want to attempt to view an XML log 
> with a non-XML aware tool (Notepad, tail, etc) which could assume the 
> current platform encoding instead UTF-8. 
  That's an assumption, and the source of lots of trouble IMHO.
If a user selects XML output, they should be able to read the encoding
statement in the declaration and understand what it means.


>  If non-USASCII characters 
> were expressed as character entities, these tools could still be used 
> without potentially misinterpreting or corrupting the data.
  No XML application is required to output ncr's athough they may
if serializing in an encoding which doesn't support that character.
Hence my simpler proposal above.



> 
> Just in the last few hours there was a bug request filed to add a 
> header/footer to the rolling file appender which would be very similar 
> to that request.

If it's doable, the xml declaration, encoding, and document element
are a valuable addition IMHO.

regards DaveP



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Writing log in xml format

Reply via email to