BruceWong96 edited a comment on pull request #18293: URL: https://github.com/apache/flink/pull/18293#issuecomment-1009895174
> Thanks for the contribution. The change should be working for your case and the code itself looks overall ok. It looks like you want to fix the HBase bug by adding work-around into the connector and I have following concerns: > > 1. Extra customized config will be introduced to users which turns out that an internal bug effects users' behaviour. > 2. The HBaseConnectorOptions has been extended for specific process logic, beyond the neutral HBase Lookup Options, HBase Configuration Options, and Flink Operator Options. This smells like the start of the hacking. > 3. HBase should be the right one to control the versions. With the change of this PR, there are two roles who can do the same job, a SRP violation. It should be used very carefully to avoid data lost in the production. I could already imagine the situation where user would complain the unexpected data lost. I understood that the default deleteMode is trying to reduce such incident, thanks for considering about that, but it is now possible for e.g. human mistake on the SQL level to make the data lost happen and, once it happens, it is expensive in the production env. > 4. This change does not cover multiple CF scenario, since the scope of version has been changed/enlarged from CF to table. Modifying the deleteMode for one CF might end up with the data lost of another CF. if not modify it, it will still have the deletion issue for the related CF. This could be solved by splitting the table, which turns out again an internal bug effects users' behaviour. There will be extra effort to maintain two maybe even more tables that conceptually belong together and keep the data synched. > 5. Shotgun surgery - too many classes have been touched for building the work-around of simple functionality. > 6. Classes defined for common case got involved into specific deletion case, i.e. HBaseWriteOptions, HBaseMutationConverter, etc. > > Looking forward to your thoughts. Thank you for your reply 1. Users need to understand the usage of the custom configuration before using it. 2. In change log mode, op= 'd' means deleting data, even before this PR. 3. In change log mode, Flink SQL calls HBase Client to modify data, which does not violate the single responsibility principle. Flink SQL can also delete data before this PR. If the user selects' sink.delete.mode '=' all-versions', the data is not accidentally lost but deleted normally. Before the PR, only the last version is deleted. 4. After the test, we found that the deleted CF was the CF specified in the Flink SQL DML statement, without affecting other CF. The specific test results can be found in JIRA. 5. I think it is not a good way to avoid data loss without providing a correct way to delete data. What we need is that the HBase Connector can meet the requirements of adding, deleting, modifying and querying in the change log mode. It seems that the deletion is not perfect, which leads to incorrect data. Looking forward to your thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
