[jira] [Work logged] (COMPRESS-540) Random access on Tar archive

ASF GitHub Bot (Jira) Thu, 19 Nov 2020 10:00:37 -0800


     [ 
https://issues.apache.org/jira/browse/COMPRESS-540?focusedWorklogId=514248&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-514248
 ]


ASF GitHub Bot logged work on COMPRESS-540:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Nov/20 17:59
            Start Date: 19/Nov/20 17:59
    Worklog Time Spent: 10m 
      Work Description: theobisproject commented on pull request #113:
URL: https://github.com/apache/commons-compress/pull/113#issuecomment-730541671


   The benchmark was added to the 
https://github.com/bodewig/commons-compress-benchmarks repository for easier 
setup.  
   
   ```java
   package de.samaflost.commons_compress.tar;
   
   import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
   import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;
   import org.apache.commons.compress.archivers.tar.TarFile;
   import org.apache.commons.compress.utils.IOUtils;
   import org.openjdk.jmh.annotations.Benchmark;
   import org.openjdk.jmh.annotations.Fork;
   import org.openjdk.jmh.annotations.Measurement;
   import org.openjdk.jmh.annotations.OutputTimeUnit;
   import org.openjdk.jmh.annotations.Scope;
   import org.openjdk.jmh.annotations.State;
   import org.openjdk.jmh.annotations.Warmup;
   import org.openjdk.jmh.util.NullOutputStream;
   
   import java.io.IOException;
   import java.io.InputStream;
   import java.nio.file.Files;
   import java.nio.file.Path;
   import java.nio.file.Paths;
   import java.util.concurrent.TimeUnit;
   
   @State(Scope.Benchmark)
   @Fork(1)
   @OutputTimeUnit(TimeUnit.SECONDS)
   @Measurement(iterations = 10)
   @Warmup(iterations = 10)
   public class ReadLargeTarBenchmark {
   
       private InputStream getLargeTarStream() throws IOException {
           return Files.newInputStream(getLargeTarPath());
       }
   
       private Path getLargeTarPath() {
           return Paths.get("path-to-file");
       }
   
       private void readEntry(final InputStream inputStream) throws IOException 
{
           IOUtils.copy(inputStream, new NullOutputStream(), 8192);
       }
   
       @Benchmark
       public void readAllEntries_tarFile() throws IOException {
           try (TarFile tarFile = new TarFile(getLargeTarPath())) {
               for (TarArchiveEntry entry : tarFile.getEntries()) {
                   try (InputStream entryInput = tarFile.getInputStream(entry)) 
{
                       readEntry(entryInput);
                   }
               }
           }
       }
   
       @Benchmark
       public void readAllEntries_tarStream() throws IOException {
           try (TarArchiveInputStream tarIn = new 
TarArchiveInputStream(getLargeTarStream())) {
               TarArchiveEntry entry;
               while ((entry = tarIn.getNextTarEntry()) != null) {
                   readEntry(tarIn);
               }
           }
       }
   
       @Benchmark
       public void readFirstEntry_tarFile() throws IOException {
           try (TarFile tarFile = new TarFile(getLargeTarPath())) {
               try (InputStream entryInput = 
tarFile.getInputStream(tarFile.getEntries().get(0))) {
                   readEntry(entryInput);
               }
           }
       }
   
       @Benchmark
       public void readFirstEntry_tarStream() throws IOException {
           try (TarArchiveInputStream tarIn = new 
TarArchiveInputStream(getLargeTarStream())) {
               tarIn.getNextTarEntry();
               readEntry(tarIn);
           }
       }
   
       @Benchmark
       public void readSecondEntry_tarFile() throws IOException {
           try (TarFile tarFile = new TarFile(getLargeTarPath())) {
               try (InputStream entryInput = 
tarFile.getInputStream(tarFile.getEntries().get(1))) {
                   readEntry(entryInput);
               }
           }
       }
   
       @Benchmark
       public void readSecondEntry_tarStream() throws IOException {
           try (TarArchiveInputStream tarIn = new 
TarArchiveInputStream(getLargeTarStream())) {
               tarIn.getNextTarEntry();
               tarIn.getNextTarEntry();
               readEntry(tarIn);
           }
       }
   
       @Benchmark
       public void readLastEntry_tarFile() throws IOException {
           try (TarFile tarFile = new TarFile(getLargeTarPath())) {
               try (InputStream entryInput = 
tarFile.getInputStream(tarFile.getEntries().get(tarFile.getEntries().size() - 
1))) {
                   readEntry(entryInput);
               }
           }
       }
   
       @Benchmark
       public void readLastEntry_tarStream() throws IOException {
           try (TarArchiveInputStream tarIn = new 
TarArchiveInputStream(getLargeTarStream())) {
               tarIn.getNextTarEntry();
               tarIn.getNextTarEntry();
               tarIn.getNextTarEntry();
               readEntry(tarIn);
           }
       }
   }
   
   ```
   
   This is the result on my machine (Windows 10, Java 11.0.9, File on SSD) with 
the testfile containing 3 times the Ubuntu 20.04 Desktop image.
   ```
   Benchmark                                         Mode  Cnt  Score   Error  
Units
   ReadLargeTarBenchmark.readAllEntries_tarFile     thrpt   10  0,228 ± 0,003  
ops/s
   ReadLargeTarBenchmark.readAllEntries_tarStream   thrpt   10  0,256 ± 0,002  
ops/s
   ReadLargeTarBenchmark.readFirstEntry_tarFile     thrpt   10  0,680 ± 0,010  
ops/s
   ReadLargeTarBenchmark.readFirstEntry_tarStream   thrpt   10  0,766 ± 0,006  
ops/s
   ReadLargeTarBenchmark.readLastEntry_tarFile      thrpt   10  0,689 ± 0,006  
ops/s
   ReadLargeTarBenchmark.readLastEntry_tarStream    thrpt   10  0,127 ± 0,001  
ops/s
   ReadLargeTarBenchmark.readSecondEntry_tarFile    thrpt   10  0,690 ± 0,012  
ops/s
   ReadLargeTarBenchmark.readSecondEntry_tarStream  thrpt   10  0,218 ± 0,003  
ops/s
   ```
   I think this demonstrates the strength and weakness of the implementation.
   
   **Strength**
   - Performance of random access is constant
   - Extracting all entries is near to the performance of the `TarArchiveStream`
   
   **Weakness**
   - Only extracting the first entry is slower because of the extra work which 
is done for the random access
   
   The performance gap should be smaller when the archive is smaller. Feel free 
to experiment with different tar contents and number of files and report back 
your result.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 514248)
    Time Spent: 4h 20m  (was: 4h 10m)

> Random access on Tar archive
> ----------------------------
>
>                 Key: COMPRESS-540
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-540
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Robin Schimpf
>            Priority: Major
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> The TarArchiveInputStream only provides sequential access. If only a small 
> amount of files from the archive is needed large amount of data in the input 
> stream needs to be skipped.
> Therefore I was working on a implementation to provide random access to 
> TarFiles equal to the ZipFile api. The basic idea behind the implementation 
> is the following
>  * Random access is backed by a SeekableByteChannel
>  * Read all headers of the tar file and save the place to the data of every 
> header
>  * User can request an input stream for any entry in the archive multiple 
> times



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (COMPRESS-540) Random access on Tar archive

Reply via email to