Apache9 commented on pull request #4055: URL: https://github.com/apache/hbase/pull/4055#issuecomment-1021952609
> Thanks for the ping, Duo. I'm still curious about what you saw. > > > I've tested the to read 1 byte, 2 bytes... And finally when I reached ~50 bytes, the parse succeeded without error > > You have a filelist which you noticed giving a this TableNotFoundException when it was parsed. When you tried parsing it by writing some custom code, you saw that with different lengths provided, the protobuf parser gave different errors? > > You are assuming that, eventually, we may have a case where HBase may write out a SFT file with a bad size (maybe HBase or HDFS bug), and this should protect us in that case? The simple crc at the head of the file sounds reasonable to prevent that causing bigger problems. I mean, for example, after serializing, a protobuf message should be ~300 bytes, but with only the leading ~50 bytes, you could deserialize the protobuf message succesfully, without any error, but the fields of the message will be different with the ones you expect. This is a very serious problem, the assumption in the old code is that, if the bytes of the protobuf message is incomplete, then we will always get a InvalidProtocolBufferException, but this is not true... And it does not need to be a bug of HDFS or HBase, if we crash while writing the content to HDFS, it is possible that we write a partial file right? So here I added a length at the beginning of the file, if the file length is not enough, i.e, we hit an EOFException, then we could say that, the file is incomplete, let's ignore it. And the trailing crc is used to test whether the content is as expected, if the crc mismatches, we will throw an IOException out and fail the region openning process. In this case, we need to manually check what is the actual problem and try to fix it, for example, regenerate the tracker file with the current store files under the data directory. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
