Hi,
Ceki's suggestion
sounds good.
Just a comment re
the XML side of things. The bulk of the bloat with XML is often the tag
names, attribute names and namespace qualifiers. The beauty of XML is that it
doesn't matter what these are; what is important is their relationship to
each other. Long meaningful names are most efficient for humans. Short
meaningless are most efficient for computers.
For
computer-to-computer communication substantial reduction in data stream size can
be achieved by using 'tag substitution' (or tag encoding). A tag conversion
table is created that maps the human-readable tag and/or attribute names to very
short machine-readable names.
This
does require sending the conversion table one time when establishing the
connection but for large data streams the savings can be significant. It is also
not much different than sending a DTD so that the sender can validate the XML
files to be sent.
Note that only the
server needs to perform the conversion since the client can simply output
the abbreviations directly.
Properly designed it should be
possible to send the conversion table as an XSLT file that can then be used
to perform the name conversion automatically.The
simple example below with only two messages shows a reduction from 327 bytes to
185 bytes or 43%. There is a savings of 142 bytes which is already greater
than the 60 or so bytes needed for the conversion table
itself.
// 120
bytes - standard log output
2001-06-04 13:38:28,664 WARN [main] XMLSample - Message 1
2001-06-04 13:38:28,664 ERROR [main] XMLSample - Message 2
2001-06-04 13:38:28,664 WARN [main] XMLSample - Message 1
2001-06-04 13:38:28,664 ERROR [main] XMLSample - Message 2
// 327
bytes - xml log output
<log4j:event category="XMLSample" timestamp="991418283544" priority="WARN" thread="main">
<log4j:message><![CDATA[Message 1]]></log4j:message>
</log4j:event>
<log4j:event category="XMLSample" timestamp="991418283544" priority="WARN" thread="main">
<log4j:message><![CDATA[Message 1]]></log4j:message>
</log4j:event>
<log4j:event category="XMLSample" timestamp="991418283554"
priority="ERROR" thread="main">
<log4j:message><![CDATA[Message 2]]></log4j:message>
</log4j:event>
<log4j:message><![CDATA[Message 2]]></log4j:message>
</log4j:event>
// 239
bytes - xml with abbreviated element names
<e category="XMLSample" timestamp="991418283544" priority="WARN" thread="main">
<m><![CDATA[Message 1]]></m>
</e>
<e category="XMLSample" timestamp="991418283544" priority="WARN" thread="main">
<m><![CDATA[Message 1]]></m>
</e>
<e
category="XMLSample" timestamp="991418283554" priority="ERROR"
thread="main">
<m><![CDATA[Message 2]]></m>
</e>
<m><![CDATA[Message 2]]></m>
</e>
// 185
bytes - xml with abbreviated element and attribute names
<e c="XMLSample" d="991418283544" p="WARN" t="main">
<m><![CDATA[Message 1]]></m>
</e>
<e c="XMLSample" d="991418283544" p="WARN" t="main">
<m><![CDATA[Message 1]]></m>
</e>
<e
c="XMLSample" d="991418283554" p="ERROR"
t="main">
<m><![CDATA[Message 2]]></m>
</e>
<m><![CDATA[Message 2]]></m>
</e>
// xml
name conversion table
c - category
d - timestamp
e - log4j:event
m - log4j:message
p - priority
t - thread
c - category
d - timestamp
e - log4j:event
m - log4j:message
p - priority
t - thread
Since Log4j uses a
very small number of 'names' this approach might be worth looking
into.
Just a
thought.
Rick