bakaid commented on issue #596: MINIFICPP-925 - Fix TailFile hang on long lines URL: https://github.com/apache/nifi-minifi-cpp/pull/596#issuecomment-505155688 @phrocker thanks for reviewing this PR. First of all, I would be more than happy to split this into multiple commits (or PRs), and I am open to postponing merging some of them after 0.7.0, if it is necessary or safer. Let me explain the rationale behind this PR, and why are these changes included. First, I created a fix for the buggy behaviour of the delimiting mode of TailFile. As I started to write tests for the fixed behaviour, I used the existing tests as a base, that build a TailFile - LogAttribute pipeline, with LogAttribute configured to log the payload, and using the logged payload to check the delimited content. Unfortunately, new tests that used long lines (required to test the fixed behaviour of lines longer than 4095 bytes) did not work. I found that was because LogAttribute was buggy. First of all, even though it self-imposed a 1 MB limit for FlowFiles to log, in reality it could not log nearly as much, because our formatted logging uses a 1024 byte long buffer to format the log message. Therefore, the content logging is truncated. Hence this fix using the non-formatted logging path of a single c_str, which does not have such limits (and could potentially be a bad idea because of the possibility of a later change which would interpret even standalone strings as printf format strings in the log pipeline, opening the possibility of a printf format string injection, but I will follow up on that later). The other issue I found was that while LogAttribute tried to hex encode the payload, it failed to do so. It tried to use message << std::hex << callback.buffer_[i]; which is ineffective, because uint8_t will be promoted to a char, and will be printed as a char, instead of interpreted as an integer to be printed in its hexadecimal form. The proper way to hex encode a string using ostreams is os << std::hex << std::setfill('0') << std::setw(2) << (int) b; Which can be found in multiple instances of the code, but unfortunately not here. But I think that the original intent is the correct one. FlowFiles are potentially non-textual, therefore they should be encoded before logging them to avoid logging non-printable characters. (Or it could be a configuration option, or we could detect whether a FlowFile is textual, but as a generic case, hex encoding them is the most straightforward way). As an added bonus the payload is encoded, therefore there is no possibility of a format string injection attack. As to why I did not simply duplicate the usual hex encoding code to this instance: I thought about how we could avoid such mistakes in the future, and I thought the best way would be to a create a single, tested hex encoder and decoder utility and use it everywhere. As it was a quick thing to do, I thought it would be worth to do it. If we do not want to include this in 0.7.0, that is OK, because it is not necessary: I can rewrite the tests to use PutFile and avoid using LogAttribute entirely. However, I do not think it is superfluous: it is fixing a bug, and making it possible to avoid other bugs in the future.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
