nicoloboschi opened a new pull request, #15431:
URL: https://github.com/apache/pulsar/pull/15431

   ### Motivation
   
   If the message value contains non-printable characters you will get 
   
   ```
   2022-04-22T22:37:29.673384094Z 22:37:29.668 [tenant/ns/topic] ERROR 
org.apache.pulsar.io.elasticsearch.ElasticSearchSink - Malformed document 
messageId=73895:0:1
   2022-04-22T22:37:29.673415795Z 
com.fasterxml.jackson.core.JsonParseException: Illegal character ((CTRL-CHAR, 
code 0)): only regular white space (\r, \n, \t) is allowed between tokens
   2022-04-22T22:37:29.673424695Z  at [Source: 
(String)"\u0000\u0000\u0002�\u000C\u0002\u001Epick_start_time\u0000\u0002\u00081211\u0002\u000ESafeway\u0002�����_\u0000\u0000\u0002\u00021\u0002\u0018445184429011";
 line: 1, column: 2]
   ```
   
   Even if you set malformedDocAction to IGNORE, the message will be 
re-delivered. In case of KEY_SHARED subscriptions this will lead to stuck 
subscriptions scenario.
   
   The issue is that [JSON format doesn't accept this kind of 
characters](https://datatracker.ietf.org/doc/html/rfc8259#section-7). 
   
   >All Unicode characters may be placed within the
      quotation marks, except for the characters that MUST be escaped:
      quotation mark, reverse solidus, and the control characters (U+0000
      through U+001F).
   
   Since usually these characters are useless, it is better to drop them all 
instead of encoding (which is not simple because it depends how much the json 
is malformed. For example, inside a key or a value you can encode them but you 
cannot between tokens)
   
   ### Modifications
   - New option `stripNonPrintableCharacters` default=true (which will trigger 
a different behaviour by default) which removes the non printable characters 
from the output json (only for the document, not the _id because ElasticSearch 
doesn't care if the _id a valid json or not). The stripping is done via RegEx 
because, unfortunately, Jackson Mapper doesn't support this out of the box.
   
   
   - [x] `doc` 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to