sijie closed pull request #1762: [WIP] Tiered storage documentation
URL: https://github.com/apache/incubator-pulsar/pull/1762
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/site/_data/cli/pulsar-admin.yaml b/site/_data/cli/pulsar-admin.yaml
index 21806199f9..147a84de6c 100644
--- a/site/_data/cli/pulsar-admin.yaml
+++ b/site/_data/cli/pulsar-admin.yaml
@@ -490,6 +490,19 @@ commands:
- flags: -w, --wait-complete
description: Wait for compaction to complete
default: 'false'
+ - name: offload
+ description: Trigger offload of data from a topic to long-term storage
(e.g. Amazon S3)
+ argument: "persistent://tenant/namespace/topic"
+ options:
+ - flags: -s, --size-threshold
+ description: The maximum amount of data to keep in BookKeeper for the
specific topic
+ - name: offload-status
+ description: Check the status of data offloading from a topic to long-term
storage
+ argument: "persistent://tenant/namespace/topic"
+ options:
+ - flags: -w, --wait-complete
+ description: Wait for offloading to complete
+ default: false
- name: create-partitioned-topic
description: Create a partitioned topic. A partitioned topic must be
created before producers can publish to it.
argument: "{persistent|non-persistent}://tenant/namespace/topic"
diff --git a/site/_data/config/broker.yaml b/site/_data/config/broker.yaml
index 3b0118083a..dad70866e1 100644
--- a/site/_data/config/broker.yaml
+++ b/site/_data/config/broker.yaml
@@ -314,3 +314,21 @@ configs:
- name: loadManagerClassName
default: org.apache.pulsar.broker.loadbalance.impl.SimpleLoadManagerImpl
description: Name of load manager to use
+- name: managedLedgerOffloadDriver
+ description: |
+ The name of the driver used for tiered storage offload. Current options:
`S3`.
+- name: managedLedgerOffloadMaxThreads
+ description: The number of threads used for tiered storage offloading
+ default: 2
+- name: s3ManagedLedgerOffloadRegion
+ description: The AWS region used for tiered storage ledger offload (if
`managedLedgerOffloadDriver` is set to `S3`)
+- name: s3ManagedLedgerOffloadBucket
+ description: The AWS bucket used for tiered storage ledger offload (if
`managedLedgerOffloadDriver` is set to `S3`)
+- name: s3ManagedLedgerOffloadServiceEndpoint
+ description: The alternative service endpoint used for Amazon S3 tiered
storage offload, which can be useful for testing (if
`managedLedgerOffloadDriver` is set to `S3`)
+- name: s3ManagedLedgerOffloadMaxBlockSizeInBytes
+ description: The maximum block size for Amazon S3 ledger offloading (in
bytes)
+ default: 67108864 (64 MB)
+- name: s3ManagedLedgerOffloadReadBufferSizeInBytes
+ description: The read buffer size for Amazon S3 ledger offloading (in bytes)
+ default: 1048576 (1 MB)
\ No newline at end of file
diff --git a/site/_data/sidebar.yaml b/site/_data/sidebar.yaml
index 950d2dad0b..1eb862e097 100644
--- a/site/_data/sidebar.yaml
+++ b/site/_data/sidebar.yaml
@@ -138,6 +138,8 @@ groups:
endpoint: message-deduplication
- title: Non-persistent messaging
endpoint: non-persistent-messaging
+ - title: Tiered storage
+ endpoint: tiered-storage
- title: Partitioned topics
endpoint: PartitionedTopics
- title: Retention and expiry
diff --git a/site/docs/latest/cookbooks/tiered-storage.md
b/site/docs/latest/cookbooks/tiered-storage.md
new file mode 100644
index 0000000000..10098d9d18
--- /dev/null
+++ b/site/docs/latest/cookbooks/tiered-storage.md
@@ -0,0 +1,50 @@
+---
+title: Tiered storage cookbook
+tags: [tiered storage, s3, bookkeeper]
+---
+
+Pulsar's **tiered storage** feature enables you to offload message data from
[BookKeeper](https://bookkeeper.apache.org) to another system. This cookbook
walks you through using tiered storage in your Pulsar cluster.
+
+{% include admonition.html type="info" content="For a more high-level,
theoretical perspective on tiered storage, see the [Concepts and
Architecture](../../getting-started/ConceptsAndArchitecture#tiered-storage)
documentation. For a guide to creating custom tiered storage driver, see the
[Custom tiered storage](../../project/tiered-storage) documentation." %}
+
+## Supported tiered storage targets {#targets}
+
+Pulsar currently supports the following tiered storage targets:
+
+* [Amazon Simple Storage Service (S3)](#s3)
+
+## Configuration {#config}
+
+In order to use tiered storage in Pulsar, you'll need to adjust the {% popover
broker %}-level configuration in each of your cluster's brokers. Broker
configuration can be set in the
[`broker.conf`](../../reference/Configuration#broker) file in the `conf`
directory of your Pulsar installation. In order to use tiered storage, you'll
first need to specify a ledger offload driver using the
`managedLedgerOffloadDriver` parameter, using the name of the driver.
+
+The following drivers are available:
+
+Driver | Configuration name
+:------|:------------------
+[Amazon S3](#s3) | `S3`
+
+In addition to specifying a driver, you can also specify the number of threads
used by ledger-offloading-related processes using the
`managedLedgerOffloadMaxThreads` parameter. The default is 2.
+
+Here's an example configuration:
+
+```conf
+# Other configs
+managedLedgerOffloadDriver=S3
+managedLedgerOffloadMaxThreads=5
+```
+
+## Amazon Simple Storage Service (S3) {#s3}
+
+In order to use the Amazon S3 ledger offloader for tiered storage, you need to
set the `managedLedgerOffloadDriver` parameter to `S3` in your [broker
configuration](#config). The following S3-specific parameters are also
available.
+
+Parameter | Description | Default
+:---------|:------------|:-------
+`s3ManagedLedgerOffloadRegion` | The AWS region used for tiered storage ledger
offload |
+`s3ManagedLedgerOffloadBucket` | The AWS bucket used for tiered storage ledger
offload |
+`s3ManagedLedgerOffloadServiceEndpoint` | The alternative service endpoint
used for Amazon S3 tiered storage offload, which can be useful for testing |
+`s3ManagedLedgerOffloadMaxBlockSizeInBytes` | The maximum block size for
Amazon S3 ledger offloading (in bytes). The default is 64 MB but also note that
the minimum is 5 MB (5242880 bytes). | 67108864 (64 MB)
+`s3ManagedLedgerOffloadReadBufferSizeInBytes` | The read buffer size for
Amazon S3 ledger offloading (in bytes) | 1048576 (1 MB)
+
+## Creating your own driver
+
+The [Amazon S3 driver](#s3) for tiered storage is the only driver that's
currently available. You can also, however, create and run your own driver by
following the instructions in the [Custom tiered
storage](../../project/tiered-storage) documentation.
\ No newline at end of file
diff --git a/site/docs/latest/getting-started/ConceptsAndArchitecture.md
b/site/docs/latest/getting-started/ConceptsAndArchitecture.md
index 2965f072d1..a19047f7ca 100644
--- a/site/docs/latest/getting-started/ConceptsAndArchitecture.md
+++ b/site/docs/latest/getting-started/ConceptsAndArchitecture.md
@@ -346,6 +346,42 @@ In BookKeeper, *journal* files contain BookKeeper
transaction logs. Before makin
A future version of BookKeeper will support *non-persistent messaging* and
thus multiple durability modes at the topic level. This will enable you to set
the durability mode at the topic level, replacing the `persistent` in topic
names with a `non-persistent` indicator.
+## Tiered storage
+
+By default, Pulsar uses [Apache BookKeeper](https://bookkeeper.apache.org) for
all [persistent message storage](#persistent-storage). BookKeeper is an ideal
system to be used for this purpose for a [variety of
reasons](https://streaml.io/blog/messaging-storage-or-both/), but BookKeeper
storage can get expensive over time. Fortunately, Pulsar also offers a **tiered
storage** capability that enables you to utilize multiple storage systems for
Pulsar message data:
+
+* BookKeeper for more recent data
+* Another system for older data
+
+With tiered storage, you can determine what counts as "older" and "more
recent" via configuration. Tiered storage in Pulsar works via a process called
**ledger offloading** process. With ledger offloading, Pulsar {% popover
brokers %}
+
+The following tiered storage offload targets are currently supported:
+
+* [Amazon's Simple Storage Service (S3)](https://aws.amazon.com/s3) (usage
docs [here](../../cookbooks/tiered-storage))
+
+### Why tiered storage?
+
+BookKeeper storage can get expensive over time. Access patterns to BookKeeper
ledgers:
+
+* Writes (low latency)
+* Tailing reads (low latency)
+* Catchup reads (latency is unimportant; throughput is important for some use
cases)
+
+By default, BookKeeper provides all three forms of storage. Tiered storage
enables you to use a non-BookKeeper system for catchup reads.
+
+* Sealed ledger --> entries are immutable. When a ledger has been sealed it no
longer needs to be stored on SSDs and can be transferred to an object storage
system like Amazon S3 or Google Cloud Storage
+* Each topic is stored on a single [managed ledger](#managed-ledgers) (list of
log segments in a fixed order, oldest first; all segments except the most
recent are sealed; the most recent still accepts writes)
+
+### How tiered storage works
+
+Implementation:
+
+* Pulsar copies ledger segments as a whole from BookKeeper to object storage
+* Once copying is complete, the segment gets **tagged** in the ML segment
list; the tag identifies the segment
+* Once the tag is added, the segment is deleted from BookKeeper
+* `ReadHandle` implementation reads from object storage
+* Interface for offloading; change ML to use offloading; triggering mechanism;
implementation for S3
+
## Message retention and expiry
By default, Pulsar message {% popover brokers %}:
diff --git a/site/docs/latest/project/tiered-storage.md
b/site/docs/latest/project/tiered-storage.md
new file mode 100644
index 0000000000..44c0643e7c
--- /dev/null
+++ b/site/docs/latest/project/tiered-storage.md
@@ -0,0 +1,43 @@
+---
+title: Custom tiered storage
+tags: [tiered storage, s3, bookkeeper, storage, stream storage]
+---
+
+[`LedgerOffloader`](https://github.com/apache/incubator-pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/LedgerOffloader.java)
+
+```java
+interface LedgerOffloader {
+ // The "write" action that offloads the BookKeeper ledger to the external
system
+ CompletableFuture<Void> offload(
+ // The identifier of the ledger
+ ReadHandle ledger,
+ // A unique ID for the offload attempt
+ UUID uid,
+ // Any metadata you'd like to add
+ Map<String, String> extraMetadata);
+
+ // The "read" action that retrieves the ledger from the external system
+ CompletableFuture<ReadHandle> readOffloaded(
+
+ long ledgerId,
+ UUID uid);
+
+ // The "delete" action that
+ CompletableFuture<Void> deleteOffloaded(long ledgerId, UUID uid);
+}
+```
+
+## Example implementation
+
+The following implementations are currently available for your perusal:
+
+*
[`S3ManagedLedgerOffloader`](https://github.com/apache/incubator-pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/s3offload/S3ManagedLedgerOffloader.java)
(for Amazon S3)
+
+## Deployment
+
+Once you've created a `LedgerOffloader` implementation, you need to:
+
+1. Package the implementation in a
[JAR](https://docs.oracle.com/javase/tutorial/deployment/jar/basicsindex.html)
file.
+1. Add that jar to the `lib` folder in your Pulsar [binary or source
distribution](../../getting-started/LocalCluster#installing-pulsar).
+1. Change the `managedLedgerOffloadDriver` configuration in
[`broker.conf`](../../reference/Configuration#broker) to your custom class
(e.g. `org.example.MyCustomLedgerOffloader`).
+1. Start up Pulsar.
\ No newline at end of file
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services