Hi, all: I drafted a PIP about configurable data source priority for offloaded messages, newest version at https://gist.github.com/Renkai/e5be927404fbfd8289e7703c55812b1c <https://gist.github.com/Renkai/e5be927404fbfd8289e7703c55812b1c> , current version post below this mail, hope anyone can help review it and make it an official PIP
Motivation Currently, if the data in pulsar was offloaded to the second storage layer, data can still exists in bookkeeper for a period of time, but the client will directly read data from the second layer. This may lead to several problems: Read from second layer have different performance characteristics, which may lead wrong estimate from users if they didn't know which layer they are reading. The second layer may be managed by another team rather than Pulsar management team(for example, a independent HDFS management team), they may have independent quota or authority policy to users. The second layer storage can be infinite in theory, if user set cursor to an error time in accident, it will cause a lot of resource waste. So it's better to make data source configurable if data exists in both layer. Maybe the below options are enough: BOOKKEEPER_ONLY BOOKKEEPER_FIRST OFFLOADED_ONLY OFFLOADED_FIRST Background Now which layer was broker read from is decide by org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl#getLedgerHandle(long ledgerId) <https://github.com/apache/pulsar/blob/a3584309017f1894a05b05c695c42e7aa8b7c3a7/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1521> which only have one parameter ledgerId , and will choose the offloaded ledger handle as soon as the ledger was offloaded. If the choosed handle fails all the getLedgerHandle fails. Implementation The tiered read priority should be set by namespace or topic, the method in command line tool should be looks like pulsar-admin namespaces --set-tiered-read-priority tenant/namespace priority-policie pulsar-admin topics --set-tiered-read-priority tenant/namespace/topic priority-policie If not configured, OFFLOADED_FIRST should be used by default, which will result to the same behavior with current version. Then the corresponding ManagedLedger should be aware what priority option client is using, and the signature the getLedgerHandle method should be change to CompletableFuture<ReadHandle> getLedgerHandle( long ledgerId, TieredReadPriority priority) { For BOOKKEEPER_ONLY and OFFLOADED_ONLY, the ManagedLedger will use the corresponding ReadHandle directly. For BOOKKEEPER_FIRST and OFFLOADED_FIRST , ManagedLedger will fall back to the secondary storage, no matter the ledger in the first layer storage does not exist, or there is something wrong in network or disk or authorization with first layer storage.