Hello Stefan,

On 26/11/22 11:33 pm, Stefan Bodewig wrote:
On 2022-11-19, Stefan Bodewig wrote:

while running all tests locally I realized
https://github.com/apache/ant/pull/194 introduces a bug when tars have
an encoding set. You can see this with UntarTest failing. A multibyte
encoding like UTF16 may contain NUL bytes inside of the "normal" part of
the name. I'll have to think about a way to solve this, but we should
not release Ant with this regression.
https://github.com/apache/ant/commit/e04fbe7ffa4f549bdc00bdfce688fb587120eedd
fixes tthe problem, at least for our test.

It makes parsing tar archives a bit slower, but that likely won't matter
much for single-byte encodings (tar isn't meant to be used with
multi-byte encodings). Also I don't think reading tars is what a normal
build does for the majority of its time.

Anyway, I'd appreciate a review of the code I've written there.

What I wanted to do is to ask the encoding how it would represent a NUL
and look search for that when finding the string terminator - as opposed
to looking for a single NUL byte.

Our testcase used UTF-16 BE with a BOM, so encoding "\0" ends up with
two bytes BOM + 0x00 0x00 - while I really only wanted to 0x00
0x00. Next attempt was to encode an empty string to see whether there is
a common prefix, but an empty string is encoded as 0 bytes, no BOM. :-)

So finally I compared "X" to "X\0" and stripped what seems to be
"X". I'm pretty sure this breaks for certain encodings, but not worse
than it has worked before.

And then I sprinkled some memoization on top.

TBM I could also live with reverting the whole commit, saying "don't use
multi-byte encodings in tars" and be done. The original test we had
worked by accident, if the old test had used UTF16-LE instead and a 49
character file name it would have failed as we'd try to decode a byte
array with an odd number of bytes.

Finding A solution was a nice puzzle, but that doesn't mean we have to
use it.

I had a look at that commit and the PR discussion where the initial fix was made. I don't have enough knowledge of this area to provide relevant inputs, but from what you have explained here and in the PR and looking at the code it looks fine to me. You (and the original contributor) note that there are still issues that this change won't solve, but I think that is OK for now. I say that because the current state/fix addresses an actual issue that was reported[1]. I've verified that the current code in master with your changes does indeed allow extracting that .tar.gz fine (unlike Ant 1.10.12). Plus our testsuite continues to pass. So that's a good thing.

[1] https://github.com/ibmruntimes/Semeru-Runtimes/issues/15

-Jaikiran


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@ant.apache.org
For additional commands, e-mail: dev-h...@ant.apache.org

Reply via email to