nicoloboschi opened a new pull request, #18668:
URL: https://github.com/apache/pulsar/pull/18668

   ### Motivation
   #15428 introduced a new flag for hashing the document id. This is a great 
option for not getting into the hard limit of 512 bytes of ElasticSearch. 
   
   In prod environment this flag is likely to not be enabled without troubles. 
If the sink was working before without that flag, all the doc ids are formatted 
as JSON format. With the same key, an update or deletion would be triggered on 
ES side. If the user enables the hashing option, those ids won't matching 
anymore and it will end up creating new records.
   
   In some cases this is not a viable option. Since the hashing algo is 
actually useful only for workaround the ES limit, it makes sense to have an 
option to only hash if the would be key length errors on ES side. 
   In this way, in a living cluster the hashing could be turned on with 100% 
compatibility with the existing documents.
   - The <= 512 bytes record will behave the same way
   - For > 512 bytes record, it's guaranteed the record is not present on 
elastic. Enabling this new option, the record will be inserted with a hashed 
(base64 format) key.
   
   The collision between an hashed key and a non hashed key is not possible:
   - hashed keys are represented in base64 (no { } characters allowed)
   - non hashed keys are json     
   
   
   To sum up, if a living env hits the issue, these are the suggested steps to 
get the sink to work again without losing messages and unblocking the 
subscription:
   1. Stop the sink
   2. Change the following config properties:
   - `canonicalKeyFields=true`
   - `idHashingAlgorithm=`SHA256`
   - `conditionalIdHashing=true`
   Note that changing may alter the key format and therefore creates new 
records but it's required to get the hashing option to works correctly.
   3. Restart the sink
   
   ### Modifications
   
   * New option `conditionalIdHashing` default to false. If the raw key is <= 
512 bytes and idHashingAlgorithm is set, then the hashing won't be performed  
   
   ### Documentation
   
   <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
   
   - [x] `doc` <!-- Your PR contains doc changes. Please attach the local 
preview screenshots (run `sh start.sh` at `pulsar/site2/website`) to your PR 
description, or else your PR might not get merged. -->
   - [ ] `doc-required` <!-- Your PR changes impact docs and you will update 
later -->
   - [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
   - [ ] `doc-complete` <!-- Docs have been already added -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to