[GitHub] [commons-compress] Glavo commented on pull request #378: Optimize the nameMap of ZipFile

via GitHub Mon, 10 Apr 2023 07:11:53 -0700


Glavo commented on PR #378:
URL: https://github.com/apache/commons-compress/pull/378#issuecomment-1501863937


   > No tests. Code coverage is down. No proof of performance change one way or 
another. Non-trivial additional complexity.
   
   The core intention of this PR is to reduce memory allocation.
   
   As mentioned earlier, currently we allocate a LinkedList for each entry in 
ZipFile, resulting in an additional 32 (LinkedList) + 24 (LinkedList$Node) 
bytes of memory overhead. This PR no longer assigns them.
   
   As for the side effect - performance improvement, this is the result of a 
simple JMH benchmark:
   
   <details><summary>Benchmark source code</summary>
   <p>
   
   ```java
   public class Compress {
   
       private static final Path jarPath = Path.of(/* path of 
commons-compress-1.23.0.jar */);
   
       private static final ZipFile globalFile ;
       private static final String[] entryNames;
   
       static {
           try {
               globalFile = new ZipFile(jarPath);
           } catch (IOException e) {
               throw new RuntimeException(e);
           }
   
           entryNames = Collections.list(globalFile.getEntries())
                   .stream()
                   .map(ZipArchiveEntry::getName)
                   .toArray(String[]::new);
       }
   
       @Benchmark
       public void testOpen() throws IOException {
           try (ZipFile file = new ZipFile(jarPath)) {
           }
       }
   
       @Benchmark
       public void testGetEntry(Blackhole blackhole) throws IOException {
           for (String entryName : entryNames) {
               blackhole.consume(globalFile.getEntry(entryName));
           }
       }
   
       @Benchmark
       public void testGetEntries(Blackhole blackhole) throws IOException {
           for (String entryName : entryNames) {
               blackhole.consume(globalFile.getEntries(entryName));
           }
       }
   
       @Benchmark
       public void testGetEntriesInPhysicalOrder(Blackhole blackhole) throws 
IOException {
           for (String entryName : entryNames) {
               
blackhole.consume(globalFile.getEntriesInPhysicalOrder(entryName));
           }
       }
   }
   
   ```
   
   </p>
   </details>
   
   commons-compress 1.23.0:
   
   ```
   Benchmark                                Mode  Cnt       Score      Error  
Units
   Compress.testGetEntries                 thrpt    3  657266.611 ± 1690.346  
ops/s
   Compress.testGetEntriesInPhysicalOrder  thrpt    3  121514.308 ± 1001.179  
ops/s
   Compress.testGetEntry                   thrpt    3  504127.618 ± 1988.320  
ops/s
   Compress.testOpen                       thrpt    3     663.604 ±   24.088  
ops/s
   ```
   
   This PR:
   
   ```
   Benchmark                                Mode  Cnt       Score       Error  
Units
   Compress.testGetEntries                 thrpt    3  465237.072 ± 37283.323  
ops/s
   Compress.testGetEntriesInPhysicalOrder  thrpt    3  459856.583 ± 13165.000  
ops/s
   Compress.testGetEntry                   thrpt    3  661743.860 ±  1055.489  
ops/s
   Compress.testOpen                       thrpt    3     665.040 ±    23.971  
ops/s
   ```
   
   For the same jar file (commons-compress-2.13.0.jar):
   
   *  Little change in performance opening zip files;
   * `getEntry` is 30% faster;
   * `getEntriesInPhysicalOrder(String)` is 280% faster.
   
   The only problem is `getEntries(String)`, which is 30% slower, because each 
call needs to create a new singleton list. But the old implementation was 
unsafe because it would return the internal mutable list that could be modified 
by the user simply by casting, thus breaking the `ZipFile`, the implementation 
in this PR fixes this issue.
   
   > "For extremely rare situations, this PR may have slight negative effects,"
   > 
   > Which are?
   >
   > "But I think this should not be a problem."
   > 
   > Because?
   
   Because this situation is too rare.
   
   As far as I know, most tools (7zip, jar commands, etc.) don't create zip 
files with entries with duplicate names. Such unusual zip files are generally 
only possible to create using some low-level api, so such files themselves are 
very rare, I can't even easily find examples in the real world.
   
   And, even for this unusual file, it's debatable whether this PR has a 
positive or negative effect. 
   
   If the majority of items do not have the same name, this PR still has a 
positive effect. 
   
   And if there are a large number of entries with the same name, the effect of 
this PR may be different under different conditions. But I don't think it's a 
good case to look into, as I can't imagine a real world use case for it.
   
   If you guys know of a real use case for duplicate entries, I'll look into 
that further. But for now, I don't see much point in caring too much about this 
situation, since regular zip files are much, much more numerous.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [commons-compress] Glavo commented on pull request #378: Optimize the nameMap of ZipFile

Reply via email to