[jira] [Work logged] (COMPRESS-540) Random access on Tar archive

ASF GitHub Bot (Jira) Sat, 26 Dec 2020 04:04:35 -0800


     [ 
https://issues.apache.org/jira/browse/COMPRESS-540?focusedWorklogId=528527&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-528527
 ]


ASF GitHub Bot logged work on COMPRESS-540:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Dec/20 12:03
            Start Date: 26/Dec/20 12:03
    Worklog Time Spent: 10m 
      Work Description: theobisproject commented on a change in pull request 
#113:
URL: https://github.com/apache/commons-compress/pull/113#discussion_r548976007



##########
File path: src/main/java/org/apache/commons/compress/archivers/tar/TarFile.java
##########
@@ -0,0 +1,736 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one or more
+ *  contributor license agreements.  See the NOTICE file distributed with
+ *  this work for additional information regarding copyright ownership.
+ *  The ASF licenses this file to You under the Apache License, Version 2.0
+ *  (the "License"); you may not use this file except in compliance with
+ *  the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ *
+ */
+package org.apache.commons.compress.archivers.tar;
+
+import java.io.ByteArrayOutputStream;
+import java.io.Closeable;
+import java.io.File;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.nio.channels.SeekableByteChannel;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+
+import org.apache.commons.compress.archivers.zip.ZipEncoding;
+import org.apache.commons.compress.archivers.zip.ZipEncodingHelper;
+import org.apache.commons.compress.utils.ArchiveUtils;
+import org.apache.commons.compress.utils.BoundedInputStream;
+import org.apache.commons.compress.utils.BoundedNIOInputStream;
+import org.apache.commons.compress.utils.BoundedSeekableByteChannelInputStream;
+import org.apache.commons.compress.utils.SeekableInMemoryByteChannel;
+
+/**
+ * The TarFile provides random access to UNIX archives.
+ * @since 1.21
+ */
+public class TarFile implements Closeable {
+
+    private static final int SMALL_BUFFER_SIZE = 256;
+
+    private final byte[] smallBuf = new byte[SMALL_BUFFER_SIZE];
+
+    private final SeekableByteChannel archive;
+
+    /**
+     * The encoding of the tar file
+     */
+    private final ZipEncoding zipEncoding;
+
+    private final LinkedList<TarArchiveEntry> entries = new LinkedList<>();
+
+    private final int blockSize;
+
+    private final boolean lenient;
+
+    private final int recordSize;
+
+    private final ByteBuffer recordBuffer;
+
+    // the global sparse headers, this is only used in PAX Format 0.X
+    private final List<TarArchiveStructSparse> globalSparseHeaders = new 
ArrayList<>();
+
+    private boolean hasHitEOF;
+
+    /**
+     * The meta-data about the current entry
+     */
+    private TarArchiveEntry currEntry;
+
+    // the global PAX header
+    private Map<String, String> globalPaxHeaders = new HashMap<>();
+
+    private final Map<String, List<InputStream>> sparseInputStreams = new 
HashMap<>();
+
+    /**
+     * Constructor for TarFile.
+     *
+     * @param content the content to use
+     * @throws IOException when reading the tar archive fails
+     */
+    public TarFile(final byte[] content) throws IOException {
+        this(new SeekableInMemoryByteChannel(content), 
TarConstants.DEFAULT_BLKSIZE, TarConstants.DEFAULT_RCDSIZE, null, false);
+    }
+
+    /**
+     * Constructor for TarFile.
+     *
+     * @param content  the content to use
+     * @param encoding the encoding to use
+     * @throws IOException when reading the tar archive fails
+     */
+    public TarFile(final byte[] content, final String encoding) throws 
IOException {
+        this(new SeekableInMemoryByteChannel(content), 
TarConstants.DEFAULT_BLKSIZE, TarConstants.DEFAULT_RCDSIZE, encoding, false);
+    }
+
+    /**
+     * Constructor for TarFile.
+     *
+     * @param content the content to use
+     * @param lenient when set to true illegal values for group/userid, mode, 
device numbers and timestamp will be
+     *                ignored and the fields set to {@link 
TarArchiveEntry#UNKNOWN}. When set to false such illegal fields cause an
+     *                exception instead.
+     * @throws IOException when reading the tar archive fails
+     */
+    public TarFile(final byte[] content, final boolean lenient) throws 
IOException {
+        this(new SeekableInMemoryByteChannel(content), 
TarConstants.DEFAULT_BLKSIZE, TarConstants.DEFAULT_RCDSIZE, null, lenient);
+    }
+
+    /**
+     * Constructor for TarFile.
+     *
+     * @param archive the file of the archive to use
+     * @throws IOException when reading the tar archive fails
+     */
+    public TarFile(final File archive) throws IOException {
+        this(archive.toPath());
+    }
+
+    /**
+     * Constructor for TarFile.
+     *
+     * @param archive  the file of the archive to use
+     * @param encoding the encoding to use
+     * @throws IOException when reading the tar archive fails
+     */
+    public TarFile(final File archive, final String encoding) throws 
IOException {
+        this(archive.toPath(), encoding);
+    }
+
+    /**
+     * Constructor for TarFile.
+     *
+     * @param archive the file of the archive to use
+     * @param lenient when set to true illegal values for group/userid, mode, 
device numbers and timestamp will be
+     *                ignored and the fields set to {@link 
TarArchiveEntry#UNKNOWN}. When set to false such illegal fields cause an
+     *                exception instead.
+     * @throws IOException when reading the tar archive fails
+     */
+    public TarFile(final File archive, final boolean lenient) throws 
IOException {
+        this(archive.toPath(), lenient);
+    }
+
+    /**
+     * Constructor for TarFile.
+     *
+     * @param archivePath the path of the archive to use
+     * @throws IOException when reading the tar archive fails
+     */
+    public TarFile(final Path archivePath) throws IOException {
+        this(Files.newByteChannel(archivePath), TarConstants.DEFAULT_BLKSIZE, 
TarConstants.DEFAULT_RCDSIZE, null, false);
+    }
+
+    /**
+     * Constructor for TarFile.
+     *
+     * @param archivePath the path of the archive to use
+     * @param encoding    the encoding to use
+     * @throws IOException when reading the tar archive fails
+     */
+    public TarFile(final Path archivePath, final String encoding) throws 
IOException {
+        this(Files.newByteChannel(archivePath), TarConstants.DEFAULT_BLKSIZE, 
TarConstants.DEFAULT_RCDSIZE, encoding, false);
+    }
+
+    /**
+     * Constructor for TarFile.
+     *
+     * @param archivePath the path of the archive to use
+     * @param lenient     when set to true illegal values for group/userid, 
mode, device numbers and timestamp will be
+     *                    ignored and the fields set to {@link 
TarArchiveEntry#UNKNOWN}. When set to false such illegal fields cause an
+     *                    exception instead.
+     * @throws IOException when reading the tar archive fails
+     */
+    public TarFile(final Path archivePath, final boolean lenient) throws 
IOException {
+        this(Files.newByteChannel(archivePath), TarConstants.DEFAULT_BLKSIZE, 
TarConstants.DEFAULT_RCDSIZE, null, lenient);
+    }
+
+    /**
+     * Constructor for TarFile.
+     *
+     * @param archive    the seekable byte channel to use
+     * @param blockSize  the blocks size to use
+     * @param recordSize the record size to use
+     * @param encoding   the encoding to use
+     * @param lenient    when set to true illegal values for group/userid, 
mode, device numbers and timestamp will be
+     *                   ignored and the fields set to {@link 
TarArchiveEntry#UNKNOWN}. When set to false such illegal fields cause an
+     *                   exception instead.
+     * @throws IOException when reading the tar archive fails
+     */
+    public TarFile(final SeekableByteChannel archive, final int blockSize, 
final int recordSize, final String encoding, final boolean lenient) throws 
IOException {
+        this.archive = archive;
+        this.hasHitEOF = false;
+        this.zipEncoding = ZipEncodingHelper.getZipEncoding(encoding);
+        this.recordSize = recordSize;
+        this.recordBuffer = ByteBuffer.allocate(this.recordSize);
+        this.blockSize = blockSize;
+        this.lenient = lenient;
+
+        TarArchiveEntry entry;
+        while ((entry = getNextTarEntry()) != null) {
+            entries.add(entry);
+        }
+    }
+
+    /**
+     * Get the next entry in this tar archive. This will skip
+     * to the end of the current entry, if there is one, and
+     * place the position of the channel at the header of the
+     * next entry, and read the header and instantiate a new
+     * TarEntry from the header bytes and return that entry.
+     * If there are no more entries in the archive, null will
+     * be returned to indicate that the end of the archive has
+     * been reached.
+     *
+     * @return The next TarEntry in the archive, or null if there is no next 
entry.
+     * @throws IOException when reading the next TarEntry fails
+     */
+    private TarArchiveEntry getNextTarEntry() throws IOException {
+        if (isAtEOF()) {
+            return null;
+        }
+
+        if (currEntry != null) {
+            // Skip to the end of the entry
+            archive.position(currEntry.getDataOffset() + currEntry.getSize());
+            throwExceptionIfPositionIsNotInArchive();
+
+            skipRecordPadding();
+        }
+
+        final ByteBuffer headerBuf = getRecord();
+        if (null == headerBuf) {
+            /* hit EOF */
+            currEntry = null;
+            return null;
+        }
+
+        try {
+            currEntry = new TarArchiveEntry(headerBuf.array(), zipEncoding, 
lenient, archive.position());
+        } catch (final IllegalArgumentException e) {
+            throw new IOException("Error detected parsing the header", e);
+        }
+
+        if (currEntry.isGNULongLinkEntry()) {
+            final byte[] longLinkData = getLongNameData();
+            if (longLinkData == null) {
+                // Bugzilla: 40334
+                // Malformed tar file - long link entry name not followed by
+                // entry
+                return null;
+            }
+            currEntry.setLinkName(zipEncoding.decode(longLinkData));
+        }
+
+        if (currEntry.isGNULongNameEntry()) {
+            final byte[] longNameData = getLongNameData();
+            if (longNameData == null) {
+                // Bugzilla: 40334
+                // Malformed tar file - long entry name not followed by
+                // entry
+                return null;
+            }
+
+            // COMPRESS-509 : the name of directories should end with '/'
+            final String name = zipEncoding.decode(longNameData);
+            currEntry.setName(name);
+            if (currEntry.isDirectory() && !name.endsWith("/")) {
+                currEntry.setName(name + "/");
+            }
+        }
+
+        if (currEntry.isGlobalPaxHeader()) { // Process Global Pax headers
+            readGlobalPaxHeaders();
+        }
+
+        try {
+            if (currEntry.isPaxHeader()) { // Process Pax headers
+                paxHeaders();
+            } else if (!globalPaxHeaders.isEmpty()) {
+                applyPaxHeadersToCurrentEntry(globalPaxHeaders, 
globalSparseHeaders);
+            }
+        } catch (NumberFormatException e) {
+            throw new IOException("Error detected parsing the pax header", e);
+        }
+
+        if (currEntry.isOldGNUSparse()) { // Process sparse files
+            readOldGNUSparse();
+        }
+
+        return currEntry;
+    }
+
+    /**
+     * Adds the sparse chunks from the current entry to the sparse chunks,
+     * including any additional sparse entries following the current entry.
+     *
+     * @throws IOException when reading the sparse entry fails
+     */
+    private void readOldGNUSparse() throws IOException {
+        if (currEntry.isExtended()) {
+            TarArchiveSparseEntry entry;
+            do {
+                final ByteBuffer headerBuf = getRecord();
+                if (headerBuf == null) {
+                    currEntry = null;
+                    break;
+                }
+                entry = new TarArchiveSparseEntry(headerBuf.array());
+                currEntry.getSparseHeaders().addAll(entry.getSparseHeaders());
+                currEntry.setDataOffset(currEntry.getDataOffset() + 
recordSize);
+            } while (entry.isExtended());
+        }
+
+        // sparse headers are all done reading, we need to build
+        // sparse input streams using these sparse headers
+        buildSparseInputStreams();
+    }
+
+    /**
+     * Build the input streams consisting of all-zero input streams and 
non-zero input streams.
+     * When reading from the non-zero input streams, the data is actually read 
from the original input stream.
+     * The size of each input stream is introduced by the sparse headers.
+     *
+     * @implNote Some all-zero input streams and non-zero input streams have 
the size of 0. We DO NOT store the
+     *        0 size input streams because they are meaningless.
+     */
+    private void buildSparseInputStreams() throws IOException {
+        List<InputStream> streams = new ArrayList<>();
+
+        final List<TarArchiveStructSparse> sparseHeaders = 
currEntry.getSparseHeaders();
+        // sort the sparse headers in case they are written in wrong order
+        if (sparseHeaders != null && sparseHeaders.size() > 1) {
+            final Comparator<TarArchiveStructSparse> sparseHeaderComparator = 
new Comparator<TarArchiveStructSparse>() {
+                @Override
+                public int compare(final TarArchiveStructSparse p, final 
TarArchiveStructSparse q) {
+                    Long pOffset = p.getOffset();
+                    Long qOffset = q.getOffset();
+                    return pOffset.compareTo(qOffset);
+                }
+            };
+            Collections.sort(sparseHeaders, sparseHeaderComparator);
+        }
+
+        if (sparseHeaders != null) {
+            // Stream doesn't need to be closed at all as it doesn't use any 
resources
+            final InputStream zeroInputStream = new 
TarArchiveSparseZeroInputStream(); //NOSONAR
+            long offset = 0;
+            long numberOfZeroBytesInSparseEntry = 0;
+            for (TarArchiveStructSparse sparseHeader : sparseHeaders) {
+                if (sparseHeader.getOffset() == 0 && 
sparseHeader.getNumbytes() == 0) {
+                    break;
+                }
+
+                if ((sparseHeader.getOffset() - offset) < 0) {
+                    throw new IOException("Corrupted struct sparse detected");
+                }
+
+                // only store the input streams with non-zero size
+                if ((sparseHeader.getOffset() - offset) > 0) {
+                    long sizeOfZeroByteStream = sparseHeader.getOffset() - 
offset;
+                    streams.add(new BoundedInputStream(zeroInputStream, 
sizeOfZeroByteStream));
+                    numberOfZeroBytesInSparseEntry += sizeOfZeroByteStream;
+                }
+
+                // only store the input streams with non-zero size
+                if (sparseHeader.getNumbytes() > 0) {
+                    long start =
+                            currEntry.getDataOffset() + 
sparseHeader.getOffset() - numberOfZeroBytesInSparseEntry;
+                    streams.add(new 
BoundedSeekableByteChannelInputStream(start, sparseHeader.getNumbytes(), 
archive));
+                }
+
+                offset = sparseHeader.getOffset() + sparseHeader.getNumbytes();
+            }
+        }
+
+        sparseInputStreams.put(currEntry.getName(), streams);
+    }
+
+    /**
+     * Update the current entry with the read pax headers
+     * @param headers Headers read from the pax header
+     * @param sparseHeaders Sparse headers read from pax header
+     */
+    private void applyPaxHeadersToCurrentEntry(final Map<String, String> 
headers, final List<TarArchiveStructSparse> sparseHeaders) {
+        currEntry.updateEntryFromPaxHeaders(headers);
+        currEntry.setSparseHeaders(sparseHeaders);
+    }
+
+    /**
+     * <p>
+     * For PAX Format 0.0, the sparse headers(GNU.sparse.offset and 
GNU.sparse.numbytes)
+     * may appear multi times, and they look like:
+     * <pre>
+     * GNU.sparse.size=size
+     * GNU.sparse.numblocks=numblocks
+     * repeat numblocks times
+     *   GNU.sparse.offset=offset
+     *   GNU.sparse.numbytes=numbytes
+     * end repeat
+     * </pre>
+     *
+     * <p>
+     * For PAX Format 0.1, the sparse headers are stored in a single variable 
: GNU.sparse.map
+     * <pre>
+     * GNU.sparse.map
+     *    Map of non-null data chunks. It is a string consisting of 
comma-separated values "offset,size[,offset-1,size-1...]"
+     * </pre>
+     *
+     * <p>
+     * For PAX Format 1.X:
+     * <br>
+     * The sparse map itself is stored in the file data block, preceding the 
actual file data.
+     * It consists of a series of decimal numbers delimited by newlines. The 
map is padded with nulls to the nearest block boundary.
+     * The first number gives the number of entries in the map. Following are 
map entries, each one consisting of two numbers
+     * giving the offset and size of the data block it describes.
+     * @throws IOException
+     */
+    private void paxHeaders() throws IOException {
+        List<TarArchiveStructSparse> sparseHeaders = new ArrayList<>();
+        final Map<String, String> headers;
+        try (final InputStream input = getInputStream(currEntry)) {
+            headers = TarUtils.parsePaxHeaders(input, sparseHeaders, 
globalPaxHeaders);
+        }
+
+        // for 0.1 PAX Headers
+        if (headers.containsKey("GNU.sparse.map")) {
+            sparseHeaders = 
TarUtils.parsePAX01SparseHeaders(headers.get("GNU.sparse.map"));
+        }
+        getNextTarEntry(); // Get the actual file entry
+        if (currEntry == null) {
+            throw new IOException("premature end of tar archive. Didn't find 
any entry after PAX header.");
+        }
+        applyPaxHeadersToCurrentEntry(headers, sparseHeaders);
+
+        // for 1.0 PAX Format, the sparse map is stored in the file data block
+        if (currEntry.isPaxGNU1XSparse()) {
+            try (final InputStream input = getInputStream(currEntry)) {
+                sparseHeaders = TarUtils.parsePAX1XSparseHeaders(input, 
recordSize);
+            }
+            currEntry.setSparseHeaders(sparseHeaders);
+            // data of the entry is after the pax gnu entry. So we need to 
update the data position once again
+            currEntry.setDataOffset(currEntry.getDataOffset() + recordSize);
+        }
+
+        // sparse headers are all done reading, we need to build
+        // sparse input streams using these sparse headers
+        buildSparseInputStreams();
+    }
+
+    private void readGlobalPaxHeaders() throws IOException {
+        try (InputStream input = getInputStream(currEntry)) {
+            globalPaxHeaders = TarUtils.parsePaxHeaders(input, 
globalSparseHeaders, globalPaxHeaders);
+        }
+        getNextTarEntry(); // Get the actual file entry
+
+        if (currEntry == null) {
+            throw new IOException("Error detected parsing the pax header");
+        }
+    }
+
+    /**
+     * Get the next entry in this tar archive as longname data.
+     *
+     * @return The next entry in the archive as longname data, or null.
+     * @throws IOException on error
+     */
+    private byte[] getLongNameData() throws IOException {
+        final ByteArrayOutputStream longName = new ByteArrayOutputStream();
+        int length;
+        try (final InputStream in = getInputStream(currEntry)) {
+            while ((length = in.read(smallBuf)) >= 0) {
+                longName.write(smallBuf, 0, length);
+            }
+        }
+        getNextTarEntry();
+        if (currEntry == null) {
+            // Bugzilla: 40334
+            // Malformed tar file - long entry name not followed by entry
+            return null;
+        }
+        byte[] longNameData = longName.toByteArray();
+        // remove trailing null terminator(s)
+        length = longNameData.length;
+        while (length > 0 && longNameData[length - 1] == 0) {
+            --length;
+        }
+        if (length != longNameData.length) {
+            final byte[] l = new byte[length];
+            System.arraycopy(longNameData, 0, l, 0, length);
+            longNameData = l;
+        }
+        return longNameData;
+    }
+
+    /**
+     * The last record block should be written at the full size, so skip any
+     * additional space used to fill a record after an entry
+     *
+     * @throws IOException when skipping the padding of the record fails
+     */
+    private void skipRecordPadding() throws IOException {
+        if (!isDirectory() && currEntry.getSize() > 0 && currEntry.getSize() % 
recordSize != 0) {
+            final long numRecords = (currEntry.getSize() / recordSize) + 1;
+            final long padding = (numRecords * recordSize) - 
currEntry.getSize();
+            archive.position(archive.position() + padding);
+            throwExceptionIfPositionIsNotInArchive();
+        }
+    }
+
+    /**
+     * Checks if the current position of the SeekableByteChannel is in the 
archive.
+     * @throws IOException If the position is not in the archive
+     */
+    private void throwExceptionIfPositionIsNotInArchive() throws IOException {
+        if (archive.size() < archive.position()) {
+            throw new IOException("Truncated TAR archive");
+        }
+    }
+
+    /**
+     * Get the next record in this tar archive. This will skip
+     * over any remaining data in the current entry, if there
+     * is one, and place the input stream at the header of the
+     * next entry.
+     *
+     * <p>If there are no more entries in the archive, null will be
+     * returned to indicate that the end of the archive has been
+     * reached.  At the same time the {@code hasHitEOF} marker will be
+     * set to true.</p>
+     *
+     * @return The next TarEntry in the archive, or null if there is no next 
entry.
+     * @throws IOException when reading the next TarEntry fails
+     */
+    private ByteBuffer getRecord() throws IOException {
+        ByteBuffer headerBuf = readRecord();
+        setAtEOF(isEOFRecord(headerBuf));
+        if (isAtEOF() && headerBuf != null) {
+            // Consume rest
+            tryToConsumeSecondEOFRecord();
+            // TODO: This is present in the TarArchiveInputStream but I don't 
know if we need this in the random access implementation. All tests are passing 
...
+            // consumeRemainderOfLastBlock();

Review comment:
       I added the implementation. The test which checks this for the 
`TarArchiveInputStream` was a bit obscured. I added a bit of documentation so 
the next one finds it easier.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 528527)
    Time Spent: 7h  (was: 6h 50m)

> Random access on Tar archive
> ----------------------------
>
>                 Key: COMPRESS-540
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-540
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Robin Schimpf
>            Priority: Major
>          Time Spent: 7h
>  Remaining Estimate: 0h
>
> The TarArchiveInputStream only provides sequential access. If only a small 
> amount of files from the archive is needed large amount of data in the input 
> stream needs to be skipped.
> Therefore I was working on a implementation to provide random access to 
> TarFiles equal to the ZipFile api. The basic idea behind the implementation 
> is the following
>  * Random access is backed by a SeekableByteChannel
>  * Read all headers of the tar file and save the place to the data of every 
> header
>  * User can request an input stream for any entry in the archive multiple 
> times



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (COMPRESS-540) Random access on Tar archive

Reply via email to