lucperkins commented on a change in pull request #1466: Topic compaction
documentation
URL: https://github.com/apache/incubator-pulsar/pull/1466#discussion_r191533828
##########
File path: site/docs/latest/getting-started/ConceptsAndArchitecture.md
##########
@@ -522,18 +541,55 @@ while (true) {
To create a reader that will read from the latest available message:
```java
-MessageId id = MessageId.latest;
-Reader reader = pulsarClient.createReader(topic, id, new
ReaderConfiguration());
+Reader<byte[]> reader = pulsarClient.newReader()
+ .topic(topic)
+ .startMessageId(MessageId.latest)
+ .create();
```
To create a reader that will read from some message between earliest and
latest:
```java
byte[] msgIdBytes = // Some byte array
MessageId id = MessageId.fromByteArray(msgIdBytes);
-Reader reader = pulsarClient.createReader(topic, id, new
ReaderConfiguration());
+Reader<byte[]> reader = pulsarClient.newReader()
+ .topic(topic)
+ .startMessageId(id)
+ .create();
```
+## Topic compaction {#compaction}
+
+Pulsar was built with highly scalable [persistent
storage](#persistent-storage) of message data as a primary objective. Pulsar {%
popover topics %} enable you to persistently store as many unacknowledged
messages as you need while preserving message ordering. By default, Pulsar
stores *all* unacknowledged/unprocessed messages produced on a topic.
Accumulating many unacknowledged messages on a topic is necessary for many
Pulsar use cases but it can also be very time intensive for Pulsar {% popover
consumers %} to "rewind" through the entire log of messages.
+
+{% include admonition.html type="success" content="For a more practical guide
to topic compaction, see the [Topic compaction
cookbook](../../cookbooks/compaction)." %}
+
+For some use cases, however, consumers don't need a complete "image" of the
topic log. They may only need a few values to construct a more "shallow" image
of the log, perhaps even just the most recent value. For these kinds of use
cases Pulsar offers **topic compaction**. When you run compaction on a topic,
Pulsar goes through a topic's backlog and removes messages that are *obscured*
by later messages, i.e. it goes through the topic on a per-key basis and leaves
only the most recent message associated with that key.
+
+Pulsar's topic compaction feature:
+
+* Can help preserve disk space and allow for much more efficient "rewind" of
topic logs
+* Applies only to [persistent topics](#persistent-storage)
+* Is triggered manually via the command line. See the [Topic compaction
cookbook](../../cookbooks/compaction)
+* Is conceptually and operationally distinct from [retention and
expiry](#message-retention-and-expiry)
+
+{% include admonition.html type="info" title="Topic compaction example: the
stock ticker"
+ content="An example use case for a compacted Pulsar topic would be a stock
ticker topic. On a stock ticker topic, each message bears a timestamped dollar
value for stocks for purchase (with the message key holding the stock symbol,
e.g. `AAPL` or `GOOG`). With a stock ticker you may care only about the most
recent value(s) of the stock and have no interest in historical data (i.e. you
don't need to construct a complete image of the topic's sequence of messages
per key). Compaction would be highly beneficial in this case because it would
keep consumers from needing to rewind through obscured messages." %}
+
+### How topic compaction works
+
+When topic compaction is triggered [via the CLI](../../cookbooks/compaction),
Pulsar will iterate over the entire topic from beginning to end. For each key
that it encounters the {% popover broker %} responsible will keep a record of
the latest occurrence of that key. When this iterative process is finished, the
broker will create a [BookKeeper ledger](#ledgers) to store the compacted topic.
+
+After that, the broker will make a second iteration through each message on
the topic. For each message, if the key matches the latest occurrence of that
key, then the key's data payload, message ID, and metadata will be written to
the newly created BookKeeper ledger. If the key doesn't match the latest then
the message will be skipped and left alone. If any given message has an empty
payload, it will be skipped and considered deleted (akin to the concept of
[tombstones](http://docs.basho.com/riak/kv/2.2.3/using/reference/object-deletion/#tombstones)
in key-value databases). At the end of this second iteration through the
topic, the newly created BookKeeper ledger is closed and two things are written
to the topic's metadata: the ID of the BookKeeper ledger and the message ID of
the last compacted message (this is known as the **compaction horizon** of the
topic). Once this metadata is written compaction is complete.
Review comment:
It's mentioned in the previous paragraph but I'll see if I can clarify that
a bit
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services