bakaid commented on issue #596: MINIFICPP-925 - Fix TailFile hang on long lines
URL: https://github.com/apache/nifi-minifi-cpp/pull/596#issuecomment-505155688
 
 
   @phrocker thanks for reviewing this PR.
   First of all, I would be more than happy to split this into multiple commits 
(or PRs), and I am open to postponing merging some of them after 0.7.0, if it 
is necessary or safer.
   Let me explain the rationale behind this PR, and why are these changes 
included.
   First, I created a fix for the buggy behaviour of the delimiting mode of 
TailFile. As I started to write tests for the fixed behaviour, I used the 
existing tests as a base, that build a TailFile - LogAttribute pipeline, with 
LogAttribute configured to log the payload, and using the logged payload to 
check the delimited content.
   Unfortunately, new tests that used long lines (required to test the fixed 
behaviour of lines longer than 4095 bytes) did not work.
   I found that was because LogAttribute was buggy.
   First of all, even though it self-imposed a 1 MB limit for FlowFiles to log, 
in reality it could not log nearly as much, because our formatted logging uses 
a 1024 byte long buffer to format the log message. Therefore, the content 
logging is truncated. Hence this fix using the non-formatted logging path of a 
single c_str, which does not have such limits (and could potentially be a bad 
idea because of the possibility of a later change which would interpret even 
standalone strings as printf format strings in the log pipeline, opening the 
possibility of a printf format string injection, but I will follow up on that 
later).
   The other issue I found was that while LogAttribute tried to hex encode the 
payload, it failed to do so.
   It tried to use
   message << std::hex << callback.buffer_[i];
   which is ineffective, because uint8_t will be promoted to a char, and will 
be printed as a char, instead of interpreted as an integer to be printed in its 
hexadecimal form.
   The proper way to hex encode a string using ostreams is
   os << std::hex << std::setfill('0') << std::setw(2) << (int) b;
   Which can be found in multiple instances of the code, but unfortunately not 
here.
   But I think that the original intent is the correct one. FlowFiles are 
potentially non-textual, therefore they should be encoded before logging them 
to avoid logging non-printable characters. (Or it could be a configuration 
option, or we could detect whether a FlowFile is textual, but as a generic 
case, hex encoding them is the most straightforward way).
   As an added bonus the payload is encoded, therefore there is no possibility 
of a format string injection attack.
   As to why I did not simply duplicate the usual hex encoding code to this 
instance: I thought about how we could avoid such mistakes in the future, and I 
thought the best way would be to a create a single, tested hex encoder and 
decoder utility and use it everywhere. As it was a quick thing to do, I thought 
it would be worth to do it.
   If we do not want to include this in 0.7.0, that is OK, because it is not 
necessary: I can rewrite the tests to use PutFile and avoid using LogAttribute 
entirely. However, I do not think it is superfluous: it is fixing a bug, and 
making it possible to avoid other bugs in the future.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to