xintongsong commented on code in PR #22855:
URL: https://github.com/apache/flink/pull/22855#discussion_r1252639172


##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/hybrid/tiered/file/SegmentPartitionFileReader.java:
##########
@@ -42,8 +82,38 @@ public Buffer readBuffer(
             MemorySegment memorySegment,
             BufferRecycler recycler)
             throws IOException {
-        // TODO, implement the HashPartitionFileReader
-        return null;
+        Map<TieredStorageSubpartitionId, Tuple2<ReadableByteChannel, Integer>> 
subpartitionInfo =
+                openedChannelAndSegmentIds.computeIfAbsent(partitionId, ignore 
-> new HashMap<>());
+        Tuple2<ReadableByteChannel, Integer> fileChannelAndSegmentId =
+                subpartitionInfo.getOrDefault(subpartitionId, Tuple2.of(null, 
-1));
+        ReadableByteChannel channel = fileChannelAndSegmentId.f0;
+        if (channel == null || fileChannelAndSegmentId.f1 != segmentId) {
+            if (channel != null) {
+                channel.close();
+            }
+            channel = openNewChannel(partitionId, subpartitionId, segmentId);
+            if (channel == null) {
+                return null;

Review Comment:
   There's no indication that `PartitionFileReader#readBuffer` may return a 
`null`, neither from JavaDoc nor the `@Nullable` annotation.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/hybrid/tiered/file/SegmentPartitionFileReader.java:
##########
@@ -42,8 +82,38 @@ public Buffer readBuffer(
             MemorySegment memorySegment,
             BufferRecycler recycler)
             throws IOException {
-        // TODO, implement the HashPartitionFileReader
-        return null;
+        Map<TieredStorageSubpartitionId, Tuple2<ReadableByteChannel, Integer>> 
subpartitionInfo =
+                openedChannelAndSegmentIds.computeIfAbsent(partitionId, ignore 
-> new HashMap<>());
+        Tuple2<ReadableByteChannel, Integer> fileChannelAndSegmentId =
+                subpartitionInfo.getOrDefault(subpartitionId, Tuple2.of(null, 
-1));
+        ReadableByteChannel channel = fileChannelAndSegmentId.f0;
+        if (channel == null || fileChannelAndSegmentId.f1 != segmentId) {
+            if (channel != null) {
+                channel.close();
+            }
+            channel = openNewChannel(partitionId, subpartitionId, segmentId);
+            if (channel == null) {
+                return null;
+            }
+            subpartitionInfo.put(subpartitionId, Tuple2.of(channel, 
segmentId));
+        }
+
+        reusedHeaderBuffer.clear();
+        int bufferHeaderResult = channel.read(reusedHeaderBuffer);
+        if (bufferHeaderResult == -1) {
+            channel.close();
+            openedChannelAndSegmentIds.get(partitionId).remove(subpartitionId);
+            return new NetworkBuffer(memorySegment, recycler, 
Buffer.DataType.END_OF_SEGMENT);
+        }
+        reusedHeaderBuffer.rewind();
+        BufferHeader header = parseBufferHeader(reusedHeaderBuffer);
+        int dataBufferResult = channel.read(memorySegment.wrap(0, 
header.getLength()));
+        if (dataBufferResult == -1) {
+            throw new IOException("Empty data buffer is read.");
+        }
+        Buffer.DataType dataType = header.getDataType();
+        return new NetworkBuffer(
+                memorySegment, recycler, dataType, header.isCompressed(), 
header.getLength());

Review Comment:
   This is a long piece of codes (over 30 lines). It would be nice to group it 
into some major steps with comments.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/hybrid/tiered/tier/remote/RemoteTierFactory.java:
##########
@@ -89,8 +90,12 @@ public TierProducerAgent createProducerAgent(
     @Override
     public TierConsumerAgent createConsumerAgent(
             List<TieredStorageConsumerSpec> tieredStorageConsumerSpecs,
-            TieredStorageNettyService nettyService) {
-        // TODO, implement the remote tier consumer agent
-        return null;
+            TieredStorageNettyService nettyService,
+            RemoteStorageScanner remoteStorageScanner) {

Review Comment:
   I don't think `RemoteStorageScanner` should be an argument for this common 
interface.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/hybrid/tiered/storage/TieredStorageConsumerClient.java:
##########
@@ -74,46 +76,54 @@ public void start() {
 
     public Optional<Buffer> getNextBuffer(
             TieredStoragePartitionId partitionId, TieredStorageSubpartitionId 
subpartitionId) {
-        Tuple2<TierConsumerAgent, Integer> currentConsumerAgentAndSegmentId =
-                currentConsumerAgentAndSegmentIds
-                        .computeIfAbsent(partitionId, ignore -> new 
HashMap<>())
-                        .getOrDefault(subpartitionId, Tuple2.of(null, 0));
+        Map<TieredStorageSubpartitionId, Tuple3<TierConsumerAgent, Integer, 
Integer>>
+                subpartitionInfo =
+                        currentConsumerInfo.computeIfAbsent(partitionId, 
ignore -> new HashMap<>());
+        Tuple3<TierConsumerAgent, Integer, Integer> agentInfo =
+                subpartitionInfo.getOrDefault(subpartitionId, Tuple3.of(null, 
0, 0));
         Optional<Buffer> buffer = Optional.empty();
-        if (currentConsumerAgentAndSegmentId.f0 == null) {
+        if (agentInfo.f0 == null) {
             for (TierConsumerAgent tierConsumerAgent : tierConsumerAgents) {
-                buffer =
-                        tierConsumerAgent.getNextBuffer(
-                                partitionId, subpartitionId, 
currentConsumerAgentAndSegmentId.f1);
+                buffer = tierConsumerAgent.getNextBuffer(partitionId, 
subpartitionId, agentInfo.f1);
                 if (buffer.isPresent()) {
-                    currentConsumerAgentAndSegmentIds
-                            .get(partitionId)
-                            .put(
-                                    subpartitionId,
-                                    Tuple2.of(
-                                            tierConsumerAgent,
-                                            
currentConsumerAgentAndSegmentId.f1));
+                    agentInfo.setField(tierConsumerAgent, 0);
                     break;
                 }
             }
         } else {
-            buffer =
-                    currentConsumerAgentAndSegmentId.f0.getNextBuffer(
-                            partitionId, subpartitionId, 
currentConsumerAgentAndSegmentId.f1);
+            buffer = agentInfo.f0.getNextBuffer(partitionId, subpartitionId, 
agentInfo.f1);
         }
         if (!buffer.isPresent()) {
             return Optional.empty();
         }
         Buffer bufferData = buffer.get();
+        if (retriever != null) {
+            retriever.retrieveAvailableAndPriority(
+                    partitionId,
+                    subpartitionId,
+                    bufferData.getDataType().hasPriority(),
+                    agentInfo.f2);
+        }
+        agentInfo.setField(agentInfo.f2 + 1, 2);
+        subpartitionInfo.put(subpartitionId, agentInfo);
         if (bufferData.getDataType() == Buffer.DataType.END_OF_SEGMENT) {
-            currentConsumerAgentAndSegmentIds
-                    .get(partitionId)
-                    .put(subpartitionId, Tuple2.of(null, 
currentConsumerAgentAndSegmentId.f1 + 1));
+            agentInfo.setField(null, 0);
+            agentInfo.setField(agentInfo.f1 + 1, 1);
+            subpartitionInfo.put(subpartitionId, agentInfo);
             bufferData.recycleBuffer();
             return getNextBuffer(partitionId, subpartitionId);
         }
         return Optional.of(bufferData);
     }
 
+    public void registerAvailabilityAndPriorityRetriever(
+            AvailabilityAndPriorityRetriever retriever) {
+        this.retriever = retriever;
+        if (remoteStorageScanner != null) {
+            
remoteStorageScanner.registerAvailabilityAndPriorityRetriever(retriever);
+        }
+    }

Review Comment:
   1. The name is confusing. I'd suggest to use `listener` rather than 
`retriever`.
   2. The argument type is an inner class of `SingleInputGate`.
   3. I think this entire change (i.e., introducing the availability and 
priority mechanism as part of the tiered storage client protocol, and migrate 
the netty related implementations) should be a separate commit in prior to 
implementing the remote tier consumer agent.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/hybrid/tiered/tier/remote/RemoteStorageScanner.java:
##########
@@ -0,0 +1,234 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.tier.remote;
+
+import org.apache.flink.api.java.tuple.Tuple3;
+import org.apache.flink.core.fs.FileStatus;
+import org.apache.flink.core.fs.FileSystem;
+import org.apache.flink.core.fs.Path;
+import 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.AvailabilityAndPriorityRetriever;
+import 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.common.TieredStoragePartitionId;
+import 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.common.TieredStorageSubpartitionId;
+import org.apache.flink.util.ExceptionUtils;
+
+import 
org.apache.flink.shaded.guava31.com.google.common.util.concurrent.ThreadFactoryBuilder;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.Executors;
+import java.util.concurrent.LinkedBlockingDeque;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+
+import static 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.file.SegmentPartitionFile.getSegmentFinishDirPath;
+import static org.apache.flink.util.Preconditions.checkArgument;
+import static org.apache.flink.util.Preconditions.checkNotNull;
+import static org.apache.flink.util.Preconditions.checkState;
+
+/**
+ * The {@link RemoteStorageScanner} is introduced to notify asynchronously for 
file reading on
+ * remote storage. Asynchronous notifications will prevent {@link 
RemoteTierConsumerAgent} from
+ * repeatedly attempting to read remote files and reduce CPU consumption.
+ *
+ * <p>It will be invoked by {@link RemoteTierConsumerAgent} to register the 
required segment id and
+ * monitor the existence status of related segment files. If the segment file 
is found, it will
+ * trigger the reading process of the file.
+ */
+public class RemoteStorageScanner implements Runnable {
+
+    /** The initial scan interval is 500ms. */
+    private static final int INITIAL_SCAN_INTERVAL = 500;
+
+    /** The max scan interval is 10000ms. */
+    private static final int MAX_SCAN_INTERVAL = 10_000;
+
+    /** Executor to scan the existence status of segment files on remote 
storage. */
+    private final ScheduledExecutorService scannerExecutor =
+            Executors.newSingleThreadScheduledExecutor(
+                    new ThreadFactoryBuilder()
+                            .setNameFormat("remote storage file scanner")
+                            .build());
+
+    /**
+     * Tuple3 containing partition id and subpartition id and segment id. The 
tuple3 is stored in
+     * queue and indicates the required segment file.
+     */
+    private final Queue<Tuple3<TieredStoragePartitionId, 
TieredStorageSubpartitionId, Integer>>
+            requiredSegmentIds;
+
+    /**
+     * The key is partition id and subpartition id, the value is max id of 
written segment files in
+     * the subpartition.
+     */
+    private final Map<TieredStoragePartitionId, 
Map<TieredStorageSubpartitionId, Integer>>
+            scannedMaxSegmentIds;
+
+    private final String baseRemoteStoragePath;
+
+    private final ScanStrategy scanStrategy;
+
+    private FileSystem remoteFileSystem;
+
+    private AvailabilityAndPriorityRetriever retriever;
+
+    private int attemptNumber = 0;
+
+    public RemoteStorageScanner(String baseRemoteStoragePath) {
+        this.baseRemoteStoragePath = baseRemoteStoragePath;
+        this.requiredSegmentIds = new LinkedBlockingDeque<>();
+        this.scannedMaxSegmentIds = new HashMap<>();
+        this.scanStrategy = new ScanStrategy(INITIAL_SCAN_INTERVAL, 
MAX_SCAN_INTERVAL);
+        try {
+            this.remoteFileSystem = new 
Path(baseRemoteStoragePath).getFileSystem();
+        } catch (IOException e) {
+            ExceptionUtils.rethrow(
+                    e, "Failed to initialize file system on the path: " + 
baseRemoteStoragePath);
+        }
+    }
+
+    /** Start the executor. */
+    public void start() {
+        scannerExecutor.schedule(
+                this, scanStrategy.getInterval(attemptNumber), 
TimeUnit.MILLISECONDS);
+    }
+
+    /**
+     * Register a segment id to the {@link RemoteStorageScanner}. If the 
scanner discovers the
+     * segment file exists, it will trigger the reading process of the segment 
file.
+     *
+     * @param partitionId is the id of partition.
+     * @param subpartitionId is the id of subpartition.
+     * @param segmentId is the id of segment.
+     */
+    public void registerSegmentId(
+            TieredStoragePartitionId partitionId,
+            TieredStorageSubpartitionId subpartitionId,
+            int segmentId) {
+        requiredSegmentIds.add(Tuple3.of(partitionId, subpartitionId, 
segmentId));
+    }
+
+    /** Close the executor. */
+    public void close() {
+        scannerExecutor.shutdownNow();
+    }
+
+    /** Iterate the registered segment ids and check related file status. */
+    @Override
+    public void run() {
+        int segmentFileNumber = requiredSegmentIds.size();
+        boolean scanned = false;
+        while (segmentFileNumber-- > 0) {
+            Tuple3<TieredStoragePartitionId, TieredStorageSubpartitionId, 
Integer> ids =
+                    checkNotNull(requiredSegmentIds.poll());
+            TieredStoragePartitionId partitionId = ids.f0;
+            TieredStorageSubpartitionId subpartitionId = ids.f1;
+            int requiredSegmentId = ids.f2;
+            int maxSegmentId =
+                    scannedMaxSegmentIds
+                            .computeIfAbsent(partitionId, ignore -> new 
HashMap<>())
+                            .getOrDefault(subpartitionId, -1);
+            if (maxSegmentId >= requiredSegmentId) {
+                scanned = true;
+                retriever.retrieveAvailableAndPriority(partitionId, 
subpartitionId, false, null);
+            } else {
+                scanMaxSegmentId(partitionId, subpartitionId);
+                requiredSegmentIds.add(ids);

Review Comment:
   Again, what if the segment is no longer needed?



##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/hybrid/tiered/tier/remote/RemoteStorageScanner.java:
##########
@@ -0,0 +1,234 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.tier.remote;
+
+import org.apache.flink.api.java.tuple.Tuple3;
+import org.apache.flink.core.fs.FileStatus;
+import org.apache.flink.core.fs.FileSystem;
+import org.apache.flink.core.fs.Path;
+import 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.AvailabilityAndPriorityRetriever;
+import 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.common.TieredStoragePartitionId;
+import 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.common.TieredStorageSubpartitionId;
+import org.apache.flink.util.ExceptionUtils;
+
+import 
org.apache.flink.shaded.guava31.com.google.common.util.concurrent.ThreadFactoryBuilder;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.Executors;
+import java.util.concurrent.LinkedBlockingDeque;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+
+import static 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.file.SegmentPartitionFile.getSegmentFinishDirPath;
+import static org.apache.flink.util.Preconditions.checkArgument;
+import static org.apache.flink.util.Preconditions.checkNotNull;
+import static org.apache.flink.util.Preconditions.checkState;
+
+/**
+ * The {@link RemoteStorageScanner} is introduced to notify asynchronously for 
file reading on
+ * remote storage. Asynchronous notifications will prevent {@link 
RemoteTierConsumerAgent} from
+ * repeatedly attempting to read remote files and reduce CPU consumption.
+ *
+ * <p>It will be invoked by {@link RemoteTierConsumerAgent} to register the 
required segment id and
+ * monitor the existence status of related segment files. If the segment file 
is found, it will
+ * trigger the reading process of the file.
+ */
+public class RemoteStorageScanner implements Runnable {
+
+    /** The initial scan interval is 500ms. */
+    private static final int INITIAL_SCAN_INTERVAL = 500;
+
+    /** The max scan interval is 10000ms. */
+    private static final int MAX_SCAN_INTERVAL = 10_000;

Review Comment:
   A good practice is to always add the unit as part of the name for 
unit-sensitive variables such as time and memory.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/hybrid/tiered/tier/remote/RemoteStorageScanner.java:
##########
@@ -0,0 +1,234 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.tier.remote;
+
+import org.apache.flink.api.java.tuple.Tuple3;
+import org.apache.flink.core.fs.FileStatus;
+import org.apache.flink.core.fs.FileSystem;
+import org.apache.flink.core.fs.Path;
+import 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.AvailabilityAndPriorityRetriever;
+import 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.common.TieredStoragePartitionId;
+import 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.common.TieredStorageSubpartitionId;
+import org.apache.flink.util.ExceptionUtils;
+
+import 
org.apache.flink.shaded.guava31.com.google.common.util.concurrent.ThreadFactoryBuilder;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.Executors;
+import java.util.concurrent.LinkedBlockingDeque;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+
+import static 
org.apache.flink.runtime.io.network.partition.hybrid.tiered.file.SegmentPartitionFile.getSegmentFinishDirPath;
+import static org.apache.flink.util.Preconditions.checkArgument;
+import static org.apache.flink.util.Preconditions.checkNotNull;
+import static org.apache.flink.util.Preconditions.checkState;
+
+/**
+ * The {@link RemoteStorageScanner} is introduced to notify asynchronously for 
file reading on
+ * remote storage. Asynchronous notifications will prevent {@link 
RemoteTierConsumerAgent} from
+ * repeatedly attempting to read remote files and reduce CPU consumption.
+ *
+ * <p>It will be invoked by {@link RemoteTierConsumerAgent} to register the 
required segment id and
+ * monitor the existence status of related segment files. If the segment file 
is found, it will
+ * trigger the reading process of the file.
+ */
+public class RemoteStorageScanner implements Runnable {
+
+    /** The initial scan interval is 500ms. */
+    private static final int INITIAL_SCAN_INTERVAL = 500;
+
+    /** The max scan interval is 10000ms. */
+    private static final int MAX_SCAN_INTERVAL = 10_000;
+
+    /** Executor to scan the existence status of segment files on remote 
storage. */
+    private final ScheduledExecutorService scannerExecutor =
+            Executors.newSingleThreadScheduledExecutor(
+                    new ThreadFactoryBuilder()
+                            .setNameFormat("remote storage file scanner")
+                            .build());
+
+    /**
+     * Tuple3 containing partition id and subpartition id and segment id. The 
tuple3 is stored in
+     * queue and indicates the required segment file.
+     */
+    private final Queue<Tuple3<TieredStoragePartitionId, 
TieredStorageSubpartitionId, Integer>>
+            requiredSegmentIds;
+
+    /**
+     * The key is partition id and subpartition id, the value is max id of 
written segment files in
+     * the subpartition.
+     */
+    private final Map<TieredStoragePartitionId, 
Map<TieredStorageSubpartitionId, Integer>>
+            scannedMaxSegmentIds;
+
+    private final String baseRemoteStoragePath;
+
+    private final ScanStrategy scanStrategy;
+
+    private FileSystem remoteFileSystem;
+
+    private AvailabilityAndPriorityRetriever retriever;
+
+    private int attemptNumber = 0;
+
+    public RemoteStorageScanner(String baseRemoteStoragePath) {
+        this.baseRemoteStoragePath = baseRemoteStoragePath;
+        this.requiredSegmentIds = new LinkedBlockingDeque<>();
+        this.scannedMaxSegmentIds = new HashMap<>();
+        this.scanStrategy = new ScanStrategy(INITIAL_SCAN_INTERVAL, 
MAX_SCAN_INTERVAL);
+        try {
+            this.remoteFileSystem = new 
Path(baseRemoteStoragePath).getFileSystem();
+        } catch (IOException e) {
+            ExceptionUtils.rethrow(
+                    e, "Failed to initialize file system on the path: " + 
baseRemoteStoragePath);
+        }
+    }
+
+    /** Start the executor. */
+    public void start() {
+        scannerExecutor.schedule(
+                this, scanStrategy.getInterval(attemptNumber), 
TimeUnit.MILLISECONDS);
+    }
+
+    /**
+     * Register a segment id to the {@link RemoteStorageScanner}. If the 
scanner discovers the
+     * segment file exists, it will trigger the reading process of the segment 
file.
+     *
+     * @param partitionId is the id of partition.
+     * @param subpartitionId is the id of subpartition.
+     * @param segmentId is the id of segment.
+     */
+    public void registerSegmentId(

Review Comment:
   I'd suggest the name `watchSegment`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to