Am 18.09.2013 um 17:10 schrieb Rainer Gerhards <[email protected]>:
> On Wed, Sep 18, 2013 at 5:05 PM, Axel Rau <[email protected]> wrote: > >> >> Am 18.09.2013 um 12:32 schrieb Risto Vaarandi <[email protected]>: >> >>> hi folks, >>> >>> I've been using the omelasticsearch output module for quite some time, >> and I am happy with it. However, there is one issue I haven't been able to >> tackle. Since I am writing data to Elasticsearch from wide variety of >> sources, I am accidentally running into syslog messages which contain some >> iso8859 characters. Unfortunately, when trying to write them into >> Elasticsearch as-is, you would get back the following error: >>> >>> org.elasticsearch.index.mapper.MapperParsingException: failed to parse >> [@message] >>> ... >>> ... >>> ... >>> Caused by: org.elasticsearch.common.jackson.core.JsonParseException: >> Invalid UTF-8 start byte 0x99 >> I have that same problem while writing UTF-8 encoded message text to >> PostgreSQL, which refuses invalid UTF-8 sequences. >> I have several event sources, producing UTF-8 text. Occasionally things >> like encoding errors in e-mail-headers produce syslog events with wrong >> UTF-8 sequences, leading to transactions being rolled back (which is >> annoying, especially with a reliable queuing setup). >> Instead of fixing various programs, input- or output-modules of rsyslog, >> we should have one central place where to (optionally) filter/correct >> illegal UTF-8 sequences. >> > > Ack .. but to do it right, I think we need to know which encoding was used > in the first place. Well, ok, to get started something that simply > "fixes"/discards invalid UTF-8 may work decently enough and in any case > better than what we currently have ;) Ack to both. > > >> >> Axel >> PS: I have some experimental code handy, which should do the job. >> > > Any place to look at it? ;) https://www.chaos1.de/downloads/utf-fix-2.c > I think a script function (like utf8fix()) Indeed. (-; > would probably be a good and fast enough to implement solution. A very > basic fix function would probably simply remove those invalid sequences, > but if there is more elaborate fixing possible, I am all interested in it > (again, I think the ultimate solution must be a conversion based on known > charset). Yes, but there are cases where it is not known. Example: Mail clients, producing headers in local encoding, without using the correct escapes. (A correct example would be: Subject: =?iso-8859-1?Q?Oh_Du_wundersch=F6ner_M=E4rz?= > Let's keep the discussion flowing :-) > > Rainer > _______________________________________________ Axel --- PGP-Key:29E99DD6 ☀ +49 151 2300 9283 ☀ computing @ chaos claudius _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

