[GitHub] [flink] BruceWong96 edited a comment on pull request #18293: [FLINK-25560][Connectors/HBase] Add "sink.delete.mode" in HBase sql connector for retracting the latest version or all versions in changelog mode.

GitBox Wed, 12 Jan 2022 23:09:42 -0800


BruceWong96 edited a comment on pull request #18293:
URL: https://github.com/apache/flink/pull/18293#issuecomment-1009895174



   > Thanks for the contribution. The change should be working for your case 
and the code itself looks overall ok. It looks like you want to fix the HBase 
bug by adding work-around into the connector and I have following concerns:
   > 
   > 1. Extra customized config will be introduced to users which turns out 
that an internal bug effects users' behaviour.
   > 2. The HBaseConnectorOptions has been extended for specific process logic, 
beyond the neutral HBase Lookup Options, HBase Configuration Options, and Flink 
Operator Options. This smells like the start of the hacking.
   > 3. HBase should be the right one to control the versions. With the change 
of this PR, there are two roles who can do the same job, a SRP violation. It 
should be used very carefully to avoid data lost in the production. I could 
already imagine the situation where user would complain the unexpected data 
lost. I understood that the default deleteMode is trying to reduce such 
incident, thanks for considering about that, but it is now possible for e.g. 
human mistake on the SQL level to make the data lost happen and, once it 
happens, it is expensive in the production env.
   > 4. This change does not cover multiple CF scenario, since the scope of 
version has been changed/enlarged from CF to table. Modifying the deleteMode 
for one CF might end up with the data lost of another CF. if not modify it, it 
will still have the deletion issue for the related CF. This could be solved by 
splitting the table, which turns out again an internal bug effects users' 
behaviour. There will be extra effort to maintain two maybe even more tables 
that conceptually belong together and keep the data synched.
   > 5. Shotgun surgery - too many classes have been touched for building the 
work-around of simple functionality.
   > 6. Classes defined for common case got involved into specific deletion 
case, i.e. HBaseWriteOptions, HBaseMutationConverter, etc.
   > 
   > Looking forward to your thoughts.
   
   Thank you for your reply
   1. Users need to understand the usage of the custom configuration before 
using it.
   2. In change log mode, op= 'd' means deleting data, even before this PR.
   3. In change log mode, Flink SQL calls HBase Client to modify data, which 
does not violate the single responsibility principle. Flink SQL can also delete 
data before this PR. If the user selects' sink.delete.mode '=' all-versions', 
the data is not accidentally lost but deleted normally. Before the PR, only the 
last version is deleted.
   4. After the test, we found that the deleted CF was the CF specified in the 
Flink SQL DML statement, without affecting other CF. The specific test results 
can be found in JIRA.
   5. I think it is not a good way to avoid data loss without providing a 
correct way to delete data.
   
   What we need is that the HBase Connector can meet the requirements of 
adding, deleting, modifying and querying in the change log mode. It seems that 
the deletion is not perfect, which leads to incorrect data.
   
   Looking forward to your thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] BruceWong96 edited a comment on pull request #18293: [FLINK-25560][Connectors/HBase] Add "sink.delete.mode" in HBase sql connector for retracting the latest version or all versions in changelog mode.

Reply via email to