Dave Marion created ACCUMULO-2353:
-------------------------------------

             Summary: Test improvments to java.io.InputStream.seek() for 
possible Hadoop patch
                 Key: ACCUMULO-2353
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2353
             Project: Accumulo
          Issue Type: Task
         Environment: Java 6 update 45 or later
Hadoop 2.2.0
            Reporter: Dave Marion
            Priority: Minor


At some point (early Java 7 I think, then backported to around Java 6 Update 
45), the java.io.InputStream.seek() method was changed from reading byte[512] 
to byte[2048]. The difference can be seen in DeflaterInputStream, which has not 
been updated:

{noformat}
    public long skip(long n) throws IOException {
        if (n < 0) {
            throw new IllegalArgumentException("negative skip length");
        }
        ensureOpen();

        // Skip bytes by repeatedly decompressing small blocks
        if (rbuf.length < 512)
            rbuf = new byte[512];

        int total = (int)Math.min(n, Integer.MAX_VALUE);
        long cnt = 0;
        while (total > 0) {
            // Read a small block of uncompressed bytes
            int len = read(rbuf, 0, (total <= rbuf.length ? total : 
rbuf.length));

            if (len < 0) {
                break;
            }
            cnt += len;
            total -= len;
        }
        return cnt;
    }
{noformat}

and java.io.InputStream in Java 6 Update 45:

{noformat}
    // MAX_SKIP_BUFFER_SIZE is used to determine the maximum buffer skip to
    // use when skipping.
    private static final int MAX_SKIP_BUFFER_SIZE = 2048;

    public long skip(long n) throws IOException {

        long remaining = n;
        int nr;

        if (n <= 0) {
            return 0;
        }
        
        int size = (int)Math.min(MAX_SKIP_BUFFER_SIZE, remaining);
        byte[] skipBuffer = new byte[size];

        while (remaining > 0) {
            nr = read(skipBuffer, 0, (int)Math.min(size, remaining));
            
            if (nr < 0) {
                break;
            }
            remaining -= nr;
        }
        
        return n - remaining;
    }
{noformat}

In sample tests I saw about a 20% improvement in skip() when seeking towards 
the end of a locally cached compressed file. Looking at the DecompressorStream 
in HDFS, the seek method is a near copy of the old InputStream method:

{noformat}
  private byte[] skipBytes = new byte[512];
  @Override
  public long skip(long n) throws IOException {
    // Sanity checks
    if (n < 0) {
      throw new IllegalArgumentException("negative skip length");
    }
    checkStream();
    
    // Read 'n' bytes
    int skipped = 0;
    while (skipped < n) {
      int len = Math.min(((int)n - skipped), skipBytes.length);
      len = read(skipBytes, 0, len);
      if (len == -1) {
        eof = true;
        break;
      }
      skipped += len;
    }
    return skipped;
  }
{noformat}

This task is to evaluate the changes to DecompressorStream with a possible 
patch to HDFS and possible bug request to Oracle to port the InputStream.seek 
changes to DeflaterInputStream.seek



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to