sijie closed pull request #1762: [WIP] Tiered storage documentation
URL: https://github.com/apache/incubator-pulsar/pull/1762
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/site/_data/cli/pulsar-admin.yaml b/site/_data/cli/pulsar-admin.yaml
index 21806199f9..147a84de6c 100644
--- a/site/_data/cli/pulsar-admin.yaml
+++ b/site/_data/cli/pulsar-admin.yaml
@@ -490,6 +490,19 @@ commands:
     - flags: -w, --wait-complete
       description: Wait for compaction to complete
       default: 'false'
+  - name: offload
+    description: Trigger offload of data from a topic to long-term storage 
(e.g. Amazon S3)
+    argument: "persistent://tenant/namespace/topic"
+    options:
+    - flags: -s, --size-threshold
+      description: The maximum amount of data to keep in BookKeeper for the 
specific topic
+  - name: offload-status
+    description: Check the status of data offloading from a topic to long-term 
storage
+    argument: "persistent://tenant/namespace/topic"
+    options:
+    - flags: -w, --wait-complete
+      description: Wait for offloading to complete
+      default: false
   - name: create-partitioned-topic
     description: Create a partitioned topic. A partitioned topic must be 
created before producers can publish to it.
     argument: "{persistent|non-persistent}://tenant/namespace/topic"
diff --git a/site/_data/config/broker.yaml b/site/_data/config/broker.yaml
index 3b0118083a..dad70866e1 100644
--- a/site/_data/config/broker.yaml
+++ b/site/_data/config/broker.yaml
@@ -314,3 +314,21 @@ configs:
 - name: loadManagerClassName
   default: org.apache.pulsar.broker.loadbalance.impl.SimpleLoadManagerImpl
   description: Name of load manager to use
+- name: managedLedgerOffloadDriver
+  description: |
+    The name of the driver used for tiered storage offload. Current options: 
`S3`.
+- name: managedLedgerOffloadMaxThreads
+  description: The number of threads used for tiered storage offloading
+  default: 2
+- name: s3ManagedLedgerOffloadRegion
+  description: The AWS region used for tiered storage ledger offload (if 
`managedLedgerOffloadDriver` is set to `S3`)
+- name: s3ManagedLedgerOffloadBucket
+  description: The AWS bucket used for tiered storage ledger offload (if 
`managedLedgerOffloadDriver` is set to `S3`)
+- name: s3ManagedLedgerOffloadServiceEndpoint
+  description: The alternative service endpoint used for Amazon S3 tiered 
storage offload, which can be useful for testing (if 
`managedLedgerOffloadDriver` is set to `S3`)
+- name: s3ManagedLedgerOffloadMaxBlockSizeInBytes
+  description: The maximum block size for Amazon S3 ledger offloading (in 
bytes)
+  default: 67108864 (64 MB)
+- name: s3ManagedLedgerOffloadReadBufferSizeInBytes
+  description: The read buffer size for Amazon S3 ledger offloading (in bytes)
+  default: 1048576 (1 MB)
\ No newline at end of file
diff --git a/site/_data/sidebar.yaml b/site/_data/sidebar.yaml
index 950d2dad0b..1eb862e097 100644
--- a/site/_data/sidebar.yaml
+++ b/site/_data/sidebar.yaml
@@ -138,6 +138,8 @@ groups:
     endpoint: message-deduplication
   - title: Non-persistent messaging
     endpoint: non-persistent-messaging
+  - title: Tiered storage
+    endpoint: tiered-storage
   - title: Partitioned topics
     endpoint: PartitionedTopics
   - title: Retention and expiry
diff --git a/site/docs/latest/cookbooks/tiered-storage.md 
b/site/docs/latest/cookbooks/tiered-storage.md
new file mode 100644
index 0000000000..10098d9d18
--- /dev/null
+++ b/site/docs/latest/cookbooks/tiered-storage.md
@@ -0,0 +1,50 @@
+---
+title: Tiered storage cookbook
+tags: [tiered storage, s3, bookkeeper]
+---
+
+Pulsar's **tiered storage** feature enables you to offload message data from 
[BookKeeper](https://bookkeeper.apache.org) to another system. This cookbook 
walks you through using tiered storage in your Pulsar cluster.
+
+{% include admonition.html type="info" content="For a more high-level, 
theoretical perspective on tiered storage, see the [Concepts and 
Architecture](../../getting-started/ConceptsAndArchitecture#tiered-storage) 
documentation. For a guide to creating custom tiered storage driver, see the 
[Custom tiered storage](../../project/tiered-storage) documentation." %}
+
+## Supported tiered storage targets {#targets}
+
+Pulsar currently supports the following tiered storage targets:
+
+* [Amazon Simple Storage Service (S3)](#s3)
+
+## Configuration {#config}
+
+In order to use tiered storage in Pulsar, you'll need to adjust the {% popover 
broker %}-level configuration in each of your cluster's brokers. Broker 
configuration can be set in the 
[`broker.conf`](../../reference/Configuration#broker) file in the `conf` 
directory of your Pulsar installation. In order to use tiered storage, you'll 
first need to specify a ledger offload driver using the 
`managedLedgerOffloadDriver` parameter, using the name of the driver.
+
+The following drivers are available:
+
+Driver | Configuration name
+:------|:------------------
+[Amazon S3](#s3) | `S3`
+
+In addition to specifying a driver, you can also specify the number of threads 
used by ledger-offloading-related processes using the 
`managedLedgerOffloadMaxThreads` parameter. The default is 2.
+
+Here's an example configuration:
+
+```conf
+# Other configs
+managedLedgerOffloadDriver=S3
+managedLedgerOffloadMaxThreads=5
+```
+
+## Amazon Simple Storage Service (S3) {#s3}
+
+In order to use the Amazon S3 ledger offloader for tiered storage, you need to 
set the `managedLedgerOffloadDriver` parameter to `S3` in your [broker 
configuration](#config). The following S3-specific parameters are also 
available.
+
+Parameter | Description | Default
+:---------|:------------|:-------
+`s3ManagedLedgerOffloadRegion` | The AWS region used for tiered storage ledger 
offload |
+`s3ManagedLedgerOffloadBucket` | The AWS bucket used for tiered storage ledger 
offload |
+`s3ManagedLedgerOffloadServiceEndpoint` | The alternative service endpoint 
used for Amazon S3 tiered storage offload, which can be useful for testing |
+`s3ManagedLedgerOffloadMaxBlockSizeInBytes` | The maximum block size for 
Amazon S3 ledger offloading (in bytes). The default is 64 MB but also note that 
the minimum is 5 MB (5242880 bytes). | 67108864 (64 MB)
+`s3ManagedLedgerOffloadReadBufferSizeInBytes` | The read buffer size for 
Amazon S3 ledger offloading (in bytes) | 1048576 (1 MB)
+
+## Creating your own driver
+
+The [Amazon S3 driver](#s3) for tiered storage is the only driver that's 
currently available. You can also, however, create and run your own driver by 
following the instructions in the [Custom tiered 
storage](../../project/tiered-storage) documentation.
\ No newline at end of file
diff --git a/site/docs/latest/getting-started/ConceptsAndArchitecture.md 
b/site/docs/latest/getting-started/ConceptsAndArchitecture.md
index 2965f072d1..a19047f7ca 100644
--- a/site/docs/latest/getting-started/ConceptsAndArchitecture.md
+++ b/site/docs/latest/getting-started/ConceptsAndArchitecture.md
@@ -346,6 +346,42 @@ In BookKeeper, *journal* files contain BookKeeper 
transaction logs. Before makin
 
 A future version of BookKeeper will support *non-persistent messaging* and 
thus multiple durability modes at the topic level. This will enable you to set 
the durability mode at the topic level, replacing the `persistent` in topic 
names with a `non-persistent` indicator.
 
+## Tiered storage
+
+By default, Pulsar uses [Apache BookKeeper](https://bookkeeper.apache.org) for 
all [persistent message storage](#persistent-storage). BookKeeper is an ideal 
system to be used for this purpose for a [variety of 
reasons](https://streaml.io/blog/messaging-storage-or-both/), but BookKeeper 
storage can get expensive over time. Fortunately, Pulsar also offers a **tiered 
storage** capability that enables you to utilize multiple storage systems for 
Pulsar message data:
+
+* BookKeeper for more recent data
+* Another system for older data
+
+With tiered storage, you can determine what counts as "older" and "more 
recent" via configuration. Tiered storage in Pulsar works via a process called 
**ledger offloading** process. With ledger offloading, Pulsar {% popover 
brokers %}
+
+The following tiered storage offload targets are currently supported:
+
+* [Amazon's Simple Storage Service (S3)](https://aws.amazon.com/s3) (usage 
docs [here](../../cookbooks/tiered-storage))
+
+### Why tiered storage?
+
+BookKeeper storage can get expensive over time. Access patterns to BookKeeper 
ledgers:
+
+* Writes (low latency)
+* Tailing reads (low latency)
+* Catchup reads (latency is unimportant; throughput is important for some use 
cases)
+
+By default, BookKeeper provides all three forms of storage. Tiered storage 
enables you to use a non-BookKeeper system for catchup reads.
+
+* Sealed ledger --> entries are immutable. When a ledger has been sealed it no 
longer needs to be stored on SSDs and can be transferred to an object storage 
system like Amazon S3 or Google Cloud Storage
+* Each topic is stored on a single [managed ledger](#managed-ledgers) (list of 
log segments in a fixed order, oldest first; all segments except the most 
recent are sealed; the most recent still accepts writes)
+
+### How tiered storage works
+
+Implementation:
+
+* Pulsar copies ledger segments as a whole from BookKeeper to object storage
+* Once copying is complete, the segment gets **tagged** in the ML segment 
list; the tag identifies the segment
+* Once the tag is added, the segment is deleted from BookKeeper
+* `ReadHandle` implementation reads from object storage
+* Interface for offloading; change ML to use offloading; triggering mechanism; 
implementation for S3
+
 ## Message retention and expiry
 
 By default, Pulsar message {% popover brokers %}:
diff --git a/site/docs/latest/project/tiered-storage.md 
b/site/docs/latest/project/tiered-storage.md
new file mode 100644
index 0000000000..44c0643e7c
--- /dev/null
+++ b/site/docs/latest/project/tiered-storage.md
@@ -0,0 +1,43 @@
+---
+title: Custom tiered storage
+tags: [tiered storage, s3, bookkeeper, storage, stream storage]
+---
+
+[`LedgerOffloader`](https://github.com/apache/incubator-pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/LedgerOffloader.java)
+
+```java
+interface LedgerOffloader {
+    // The "write" action that offloads the BookKeeper ledger to the external 
system
+    CompletableFuture<Void> offload(
+            // The identifier of the ledger
+            ReadHandle ledger,
+            // A unique ID for the offload attempt
+            UUID uid,
+            // Any metadata you'd like to add
+            Map<String, String> extraMetadata);
+    
+    // The "read" action that retrieves the ledger from the external system
+    CompletableFuture<ReadHandle> readOffloaded(
+
+            long ledgerId,
+            UUID uid);
+
+    // The "delete" action that 
+    CompletableFuture<Void> deleteOffloaded(long ledgerId, UUID uid);
+}
+```
+
+## Example implementation
+
+The following implementations are currently available for your perusal:
+
+* 
[`S3ManagedLedgerOffloader`](https://github.com/apache/incubator-pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/s3offload/S3ManagedLedgerOffloader.java)
 (for Amazon S3)
+
+## Deployment
+
+Once you've created a `LedgerOffloader` implementation, you need to:
+
+1. Package the implementation in a 
[JAR](https://docs.oracle.com/javase/tutorial/deployment/jar/basicsindex.html) 
file.
+1. Add that jar to the `lib` folder in your Pulsar [binary or source 
distribution](../../getting-started/LocalCluster#installing-pulsar).
+1. Change the `managedLedgerOffloadDriver` configuration in 
[`broker.conf`](../../reference/Configuration#broker) to your custom class 
(e.g. `org.example.MyCustomLedgerOffloader`).
+1. Start up Pulsar.
\ No newline at end of file


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to