Am 18.09.2013 um 17:10 schrieb Rainer Gerhards <[email protected]>:

> On Wed, Sep 18, 2013 at 5:05 PM, Axel Rau <[email protected]> wrote:
> 
>> 
>> Am 18.09.2013 um 12:32 schrieb Risto Vaarandi <[email protected]>:
>> 
>>> hi folks,
>>> 
>>> I've been using the omelasticsearch output module for quite some time,
>> and I am happy with it. However, there is one issue I haven't been able to
>> tackle. Since I am writing data to Elasticsearch from wide variety of
>> sources, I am accidentally running into syslog messages which contain some
>> iso8859 characters. Unfortunately, when trying to write them into
>> Elasticsearch as-is, you would get back the following error:
>>> 
>>> org.elasticsearch.index.mapper.MapperParsingException: failed to parse
>> [@message]
>>> ...
>>> ...
>>> ...
>>> Caused by: org.elasticsearch.common.jackson.core.JsonParseException:
>> Invalid UTF-8 start byte 0x99
>> I have that same problem while writing UTF-8 encoded message text to
>> PostgreSQL, which refuses invalid UTF-8 sequences.
>> I have several event sources, producing UTF-8 text. Occasionally things
>> like encoding errors in e-mail-headers produce syslog events with wrong
>> UTF-8 sequences, leading to transactions being rolled back (which is
>> annoying, especially with a reliable queuing setup).
>> Instead of fixing various programs, input- or output-modules of rsyslog,
>> we should have one central place where to (optionally) filter/correct
>> illegal UTF-8 sequences.
>> 
> 
> Ack .. but to do it right, I think we need to know which encoding was used
> in the first place. Well, ok, to get started something that simply
> "fixes"/discards invalid UTF-8 may work decently enough and in any case
> better than what we currently have ;)
Ack to both.
> 
> 
>> 
>> Axel
>> PS: I have some experimental code handy, which should do the job.
>> 
> 
> Any place to look at it? ;)
https://www.chaos1.de/downloads/utf-fix-2.c
> I think a script function (like utf8fix())
Indeed. (-;
> would probably be a good and fast enough to implement solution. A very
> basic fix function would probably simply remove those invalid sequences,
> but if there is more elaborate fixing possible, I am all interested in it
> (again, I think the ultimate solution must be a conversion based on known
> charset).
Yes, but there are cases where it is not known.
Example: Mail clients, producing headers in local encoding, without using the 
correct escapes.
(A correct example would be:
        Subject: =?iso-8859-1?Q?Oh_Du_wundersch=F6ner_M=E4rz?=

> Let's keep the discussion flowing :-)
> 
> Rainer
> _______________________________________________


Axel
---
PGP-Key:29E99DD6  ☀ +49 151 2300 9283  ☀ computing @ chaos claudius

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to