Tianyi Wang has posted comments on this change. Change subject: IMPALA-5250: Unify decompressor output_length semantics ......................................................................
Patch Set 1: (18 comments) http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG Commit Message: Line 9: This patch makes the semantics output_length parameter in > the semantics of the output_length Done Line 10: Codec::ProcessBlock to be the same. In existing code different > Codec::ProcessBlock() to be the same across all codecs. In the existing cod Done Line 13: to actual decompressed length, but it does not set it to actual > to the actual decompressed length Done Line 16: be exactly the same as actual decompressed length, otherwise > as the actual decompressed length Done Line 19: actual decompressed length and will set it to actual decompressed > the actual decompressed length Done Line 21: This inconsistency leads to a bug where the error message is > This inconsistency lead to a bug where .... Do you mean changing it to plural? Line 24: decompressors can handle oversized output buffer correctly. > can handle an oversized output buffer correctly. Done Line 25: Lz4Decompressor will use the "safe" version of decompression function > will use the "safe" instead of the "fast" decompression function Done Line 26: instead of the "fast" version, for the latter is insecure with corrupted > We use LZ4 for compressing exchange data, so we check the perf impact of th Is there an individual benchmark? There is one for compression in be/src/experiments/compression-test.cc. Line 27: data and requires decompressed length to be known. > and requires the decompressed length to be known Done http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc File be/src/util/decompress-test.cc: Line 153: // Test the behavior when the decompressor is given too little / too much space > "." at end of sentence Done Line 155: // correct output size when the space is enough, and does not write beyond the output > the correct output size Done Line 157: void DecompressOverUnderSizedOutputBuffer(Codec *compressor, Codec *decompressor, > Out style is "Codec* compressor" Done Line 196: } > I think we should test more directly the scenario mentioned in the JIRA whe I'm not sure if my interpretation of JIRA is correct. In my understanding: 1. The parquet file is corrupted so either current_page_header_.uncompressed_page_size is greater or current_page_header_.compressed_page_size is less than what it should be. 2. Either way we allocated a larger buffer and the decompressor decompressed the (probably incomplete) data successfully. 3. The part of the buffer not written by the decompressor is uninitilized. Because the decompressor returned OK, somehow we read this uninitilized part later and trigger undeterministic errors. The problem here is in 2: The decompressor should report how much data it decompressed even if the decompression is successful. Then ReadDataPage() can check whether it is the same as current_page_header_.uncompressed_page_size. So I think we should test whether output_len is set properly with an successful decompression. I don't understand "set to 0" part. When should it be set to 0? http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc File be/src/util/decompress.cc: Line 426: // If size_only is false, output buffer size must be at least output_len. *output_len is > Mention what output_len is set to if a non-OK status is returned. Done. It is set to 0 in this case, while other decompressor may leave it unmodified or set it to an arbitrary intermediate value. I think the complication here is caused by the design that 'output_len' and 'output' are output parameters. When a parameter is both input and output, and could be passed as reference further into callees of callee, there could be all kinds of bugs forgetting to set the correct value here and there. I think if output and output_len is in the return value, creating such bugs is much more difficult, and the code will be much easier to follow since the data flow is clear and explicit. I think we should have a Status<T> type for return values. It represents either a value or an error status, and then we can avoid output parameters in new codes. Line 427: // also updated with actual output size. > the actual output size Done Line 575: // LZ4_decompress_fast will cause a segmentation fault if passed a nullptr output. > Comment still relevant? Removed. Line 579: if (ret < 0) { > single line Done -- To view, visit http://gerrit.cloudera.org:8080/8030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi Wang <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Tianyi Wang <[email protected]> Gerrit-HasComments: Yes
