Glavo commented on PR #378:
URL: https://github.com/apache/commons-compress/pull/378#issuecomment-1501863937
> No tests. Code coverage is down. No proof of performance change one way or
another. Non-trivial additional complexity.
The core intention of this PR is to reduce memory allocation.
As mentioned earlier, currently we allocate a LinkedList for each entry in
ZipFile, resulting in an additional 32 (LinkedList) + 24 (LinkedList$Node)
bytes of memory overhead. This PR no longer assigns them.
As for the side effect - performance improvement, this is the result of a
simple JMH benchmark:
<details><summary>Benchmark source code</summary>
<p>
```java
public class Compress {
private static final Path jarPath = Path.of(/* path of
commons-compress-1.23.0.jar */);
private static final ZipFile globalFile ;
private static final String[] entryNames;
static {
try {
globalFile = new ZipFile(jarPath);
} catch (IOException e) {
throw new RuntimeException(e);
}
entryNames = Collections.list(globalFile.getEntries())
.stream()
.map(ZipArchiveEntry::getName)
.toArray(String[]::new);
}
@Benchmark
public void testOpen() throws IOException {
try (ZipFile file = new ZipFile(jarPath)) {
}
}
@Benchmark
public void testGetEntry(Blackhole blackhole) throws IOException {
for (String entryName : entryNames) {
blackhole.consume(globalFile.getEntry(entryName));
}
}
@Benchmark
public void testGetEntries(Blackhole blackhole) throws IOException {
for (String entryName : entryNames) {
blackhole.consume(globalFile.getEntries(entryName));
}
}
@Benchmark
public void testGetEntriesInPhysicalOrder(Blackhole blackhole) throws
IOException {
for (String entryName : entryNames) {
blackhole.consume(globalFile.getEntriesInPhysicalOrder(entryName));
}
}
}
```
</p>
</details>
commons-compress 1.23.0:
```
Benchmark Mode Cnt Score Error
Units
Compress.testGetEntries thrpt 3 657266.611 ± 1690.346
ops/s
Compress.testGetEntriesInPhysicalOrder thrpt 3 121514.308 ± 1001.179
ops/s
Compress.testGetEntry thrpt 3 504127.618 ± 1988.320
ops/s
Compress.testOpen thrpt 3 663.604 ± 24.088
ops/s
```
This PR:
```
Benchmark Mode Cnt Score Error
Units
Compress.testGetEntries thrpt 3 465237.072 ± 37283.323
ops/s
Compress.testGetEntriesInPhysicalOrder thrpt 3 459856.583 ± 13165.000
ops/s
Compress.testGetEntry thrpt 3 661743.860 ± 1055.489
ops/s
Compress.testOpen thrpt 3 665.040 ± 23.971
ops/s
```
For the same jar file (commons-compress-2.13.0.jar):
* Little change in performance opening zip files;
* `getEntry` is 30% faster;
* `getEntriesInPhysicalOrder(String)` is 280% faster.
The only problem is `getEntries(String)`, which is 30% slower, because each
call needs to create a new singleton list. But the old implementation was
unsafe because it would return the internal mutable list that could be modified
by the user simply by casting, thus breaking the `ZipFile`, the implementation
in this PR fixes this issue.
> "For extremely rare situations, this PR may have slight negative effects,"
>
> Which are?
>
> "But I think this should not be a problem."
>
> Because?
Because this situation is too rare.
As far as I know, most tools (7zip, jar commands, etc.) don't create zip
files with entries with duplicate names. Such unusual zip files are generally
only possible to create using some low-level api, so such files themselves are
very rare, I can't even easily find examples in the real world.
And, even for this unusual file, it's debatable whether this PR has a
positive or negative effect.
If the majority of items do not have the same name, this PR still has a
positive effect.
And if there are a large number of entries with the same name, the effect of
this PR may be different under different conditions. But I don't think it's a
good case to look into, as I can't imagine a real world use case for it.
If you guys know of a real use case for duplicate entries, I'll look into
that further. But for now, I don't see much point in caring too much about this
situation, since regular zip files are much, much more numerous.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]