I'm seeing a problem using the S3AFileSystem where I get a premature EOF.  In 
particular when reading Parquet files using projection. It appears this causes 
a backward seek on the S3A input stream which then triggers a bug where the 
input stream will attempt to use the "random" input policy, and eventually 
(depending on the file and the amount of data read) read past the end of its 
readahead buffer without reopening the stream ... resulting in an EOF.

I haven't seen this issue reported anywhere, I'm wondering about whether this 
worth a fix (it looks like this stream reopening behavior just needs to be more 
aggressive) or if it's just better to retrieve the whole file sequentially 
before attempting to parse (I was suprised it works at all).


I've written a sample program that illustrates the problem given a path in S3 
(minus parquet):

final Configuration conf = new Configuration();
conf.set("fs.s3a.readahead.range", "1K");
conf.set("fs.s3a.experimental.input.fadvise", "random");

final FileSystem fs = FileSystem.get(path.toUri(), conf);

// forward seek reading across readahead boundary
try (FSDataInputStream in = fs.open(path)) {
    final byte[] temp = new byte[5];
    in.readByte();
    in.readFully(1023, temp); // <-- works
}

// forward seek reading from end of readahead boundary
try (FSDataInputStream in = fs.open(path)) {
final byte[] temp = new byte[5];
in.readByte();
in.readFully(1024, temp); // <-- throws EOFException
}


Regards, Dave Christianson

Reply via email to