[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Tianyi Wang (Code Review) Tue, 12 Sep 2017 13:34:31 -0700

Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................



Patch Set 1:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG
Commit Message:

Line 9: This patch makes the semantics output_length parameter in
> the semantics of the output_length
Done


Line 10: Codec::ProcessBlock to be the same. In existing code different
> Codec::ProcessBlock() to be the same across all codecs. In the existing cod
Done


Line 13:    to actual decompressed length, but it does not set it to actual
> to the actual decompressed length
Done


Line 16:    be exactly the same as actual decompressed length, otherwise
> as the actual decompressed length
Done


Line 19:    actual decompressed length and will set it to actual decompressed
> the actual decompressed length
Done


Line 21: This inconsistency leads to a bug where the error message is
> This inconsistency lead to a bug where ....
Do you mean changing it to plural?


Line 24: decompressors can handle oversized output buffer correctly.
> can handle an oversized output buffer correctly.
Done


Line 25: Lz4Decompressor will use the "safe" version of decompression function
> will use the "safe" instead of the "fast" decompression function
Done


Line 26: instead of the "fast" version, for the latter is insecure with 
corrupted
> We use LZ4 for compressing exchange data, so we check the perf impact of th
Is there an individual benchmark?
There is one for compression in be/src/experiments/compression-test.cc.


Line 27: data and requires decompressed length to be known.
> and requires the decompressed length to be known
Done


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

Line 153:   // Test the behavior when the decompressor is given too little / 
too much space
> "." at end of sentence
Done


Line 155:   // correct output size when the space is enough, and does not write 
beyond the output
> the correct output size
Done


Line 157:   void DecompressOverUnderSizedOutputBuffer(Codec *compressor, Codec 
*decompressor,
> Out style is "Codec* compressor"
Done


Line 196:     }
> I think we should test more directly the scenario mentioned in the JIRA whe
I'm not sure if my interpretation of JIRA is correct. In my understanding:
1. The parquet file is corrupted so either 
current_page_header_.uncompressed_page_size is greater or 
current_page_header_.compressed_page_size is less than what it should be. 
2. Either way we allocated a larger buffer and the decompressor decompressed 
the (probably incomplete) data successfully. 
3. The part of the buffer not written by the decompressor is uninitilized. 
Because the decompressor returned OK, somehow we read this uninitilized part 
later and trigger undeterministic errors.
The problem here is in 2: The decompressor should report how much data it 
decompressed even if the decompression is successful. Then ReadDataPage() can 
check whether it is the same as current_page_header_.uncompressed_page_size.
So I think we should test whether output_len is set properly with an successful 
decompression. I don't understand "set to 0" part. When should it be set to 0?


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc
File be/src/util/decompress.cc:

Line 426: // If size_only is false, output buffer size must be at least 
output_len. *output_len is
> Mention what output_len is set to if a non-OK status is returned.
Done. It is set to 0 in this case, while other decompressor may leave it 
unmodified or set it to an arbitrary intermediate value.
I think the complication here is caused by the design that 'output_len' and 
'output' are output parameters. When a parameter is both input and output, and 
could be passed as reference further into callees of callee, there could be all 
kinds of bugs forgetting to set the correct value here and there. I think if 
output and output_len is in the return value, creating such bugs is much more 
difficult, and the code will be much easier to follow since the data flow is 
clear and explicit.

I think we should have a Status<T> type for return values. It represents either 
a value or an error status, and then we can avoid output parameters in new 
codes.


Line 427: // also updated with actual output size.
> the actual output size
Done


Line 575:   // LZ4_decompress_fast will cause a segmentation fault if passed a 
nullptr output.
> Comment still relevant?
Removed.


Line 579:   if (ret < 0) {
> single line
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Tianyi Wang <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Reply via email to