[
https://issues.apache.org/jira/browse/HADOOP-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829993#comment-17829993
]
ASF GitHub Bot commented on HADOOP-19098:
-----------------------------------------
mukund-thakur commented on code in PR #6604:
URL: https://github.com/apache/hadoop/pull/6604#discussion_r1536087673
##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/VectoredReadUtils.java:
##########
@@ -133,26 +172,42 @@ private static void
readNonByteBufferPositionedReadable(PositionedReadable strea
/**
* Read bytes from stream into a byte buffer using an
* intermediate byte array.
- * @param length number of bytes to read.
+ * <pre>
+ * (position, buffer, buffer-offset, length): Void
+ * position:= the position within the file to read data.
+ * buffer := a buffer to read fully `length` bytes into.
+ * buffer-offset := the offset within the buffer to write data
+ * length := the number of bytes to read.
+ * </pre>
+ * The passed in function MUST block until the required length of
+ * data is read, or an exception is thrown.
+ * @param range range to read
* @param buffer buffer to fill.
* @param operation operation to use for reading data.
* @throws IOException any IOE.
*/
- public static void readInDirectBuffer(int length,
- ByteBuffer buffer,
- Function4RaisingIOE<Integer, byte[],
Integer,
- Integer, Void> operation)
throws IOException {
+ public static void readInDirectBuffer(FileRange range,
+ ByteBuffer buffer,
+ Function4RaisingIOE<Long, byte[], Integer, Integer, Void> operation)
+ throws IOException {
+
+ LOG.debug("Reading {} into a direct buffer", range);
+ validateRangeRequest(range);
Review Comment:
we would have already validated this. why again?
##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java:
##########
@@ -885,19 +904,26 @@ public int maxReadSizeForVectorReads() {
* @throws IOException IOE if any.
*/
@Override
- public void readVectored(List<? extends FileRange> ranges,
+ public synchronized void readVectored(List<? extends FileRange> ranges,
Review Comment:
moving to synchronized? We are supporting multiple readVectored() in the
same stream and it should be fine.
##########
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md:
##########
@@ -459,51 +459,119 @@ The position returned by `getPos()` after
`readVectored()` is undefined.
If a file is changed while the `readVectored()` operation is in progress, the
output is
undefined. Some ranges may have old data, some may have new, and some may have
both.
-While a `readVectored()` operation is in progress, normal read api calls may
block.
-
-Note: Don't use direct buffers for reading from ChecksumFileSystem as that may
-lead to memory fragmentation explained in HADOOP-18296.
+While a `readVectored()` operation is in progress, normal read API calls MAY
block;
+the value of `getPos(`) is also undefined. Applications SHOULD NOT make such
requests
+while waiting for the results of a vectored read.
+Note: Don't use direct buffers for reading from `ChecksumFileSystem` as that
may
+lead to memory fragmentation explained in
+[HADOOP-18296](https://issues.apache.org/jira/browse/HADOOP-18296)
+_Memory fragmentation in ChecksumFileSystem Vectored IO implementation_
#### Preconditions
-For each requested range:
+No empty lists.
+
+```python
+if ranges = null raise NullPointerException
+if ranges.len() = 0 raise IllegalArgumentException
+if allocate = null raise NullPointerException
+```
+
+For each requested range `range[i]` in the list of ranges `range[0..n]` sorted
+on `getOffset()` ascending such that
+
+for all `i where i > 0`:
- range.getOffset >= 0 else raise IllegalArgumentException
- range.getLength >= 0 else raise EOFException
+ range[i].getOffset() > range[i-1].getOffset()
+
+For all ranges `0..i` the preconditions are:
+
+```python
+ranges[i] != null else raise IllegalArgumentException
+ranges[i].getOffset() >= 0 else raise EOFException
+ranges[i].getLength() >= 0 else raise IllegalArgumentException
+if i > 0 and ranges[i].getOffset() < (ranges[i-1].getOffset() +
ranges[i-1].getLength) :
+ raise IllegalArgumentException
+```
+If the length of the file is known during the validation phase:
+
+```python
+if range[i].getOffset + range[i].getLength >= data.length() raise EOFException
+```
#### Postconditions
-For each requested range:
+For each requested range `range[i]` in the list of ranges `range[0..n]`
+
+```
+ranges[i]'.getData() = CompletableFuture<buffer: ByteBuffer>
+```
- range.getData() returns CompletableFuture<ByteBuffer> which will have data
- from range.getOffset to range.getLength.
+ and when `getData().get()` completes:
+```
+let buffer = `getData().get()
+let len = ranges[i].getLength()
+let data = new byte[len]
+(buffer.position() - buffer.limit) = len
+buffer.get(data, 0, len) = readFully(ranges[i].getOffset(), data, 0, len)
+```
-### `minSeekForVectorReads()`
+That is: the result of every ranged read is the result of the (possibly
asynchronous)
+call to `PositionedReadable.readFully()` for the same offset and length
+
+#### `minSeekForVectorReads()`
The smallest reasonable seek. Two ranges won't be merged together if the
difference between
end of first and start of next range is more than this value.
-### `maxReadSizeForVectorReads()`
+#### `maxReadSizeForVectorReads()`
Maximum number of bytes which can be read in one go after merging the ranges.
-Two ranges won't be merged if the combined data to be read is more than this
value.
+Two ranges won't be merged if the combined data to be read It's okay we have a
look at what we do right now for readOkayis more than this value.
Review Comment:
typo: some random lines I guess
##########
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md:
##########
@@ -441,9 +441,9 @@ The semantics of this are exactly equivalent to
readFully(position, buffer, 0, len(buffer))
That is, the buffer is filled entirely with the contents of the input source
-from position `position`
+from position `position`.
-### `default void readVectored(List<? extends FileRange> ranges,
IntFunction<ByteBuffer> allocate)`
+### `void readVectored(List<? extends FileRange> ranges,
IntFunction<ByteBuffer> allocate)`
Review Comment:
are we removing the default?
> Vector IO: consistent specified rejection of overlapping ranges
> ---------------------------------------------------------------
>
> Key: HADOOP-19098
> URL: https://issues.apache.org/jira/browse/HADOOP-19098
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs, fs/s3
> Affects Versions: 3.3.6
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
> Labels: pull-request-available
>
> Related to PARQUET-2171 q: "how do you deal with overlapping ranges?"
> I believe s3a rejects this, but the other impls may not.
> Proposed
> FS spec to say
> * "overlap triggers IllegalArgumentException".
> * special case: 0 byte ranges may be short circuited to return empty buffer
> even without checking file length etc.
> Contract tests to validate this
> (+ common helper code to do this).
> I'll copy the validation stuff into the parquet PR for consistency with older
> releases
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]