[RFC] Guide to writing output filters

Joe Orton Fri, 16 Mar 2007 14:55:51 -0800

http://people.apache.org/~jorton/output-filters.html


How does this look?  Anything missed out, anything that doesn't make 
sense?  I think this covers most of the major problems in output filters 
which keep coming up.

I'd also like to add a simple buffering filter which "does things right" 
and can be used as a reference; all other in-tree filters are either too 
complicated (filters/*, http/* etc) or too awful (experimental/*).  Any 
objections?

Regards,

joe

Index: docs/manual/developer/output-filters.xml
===================================================================
--- docs/manual/developer/output-filters.xml    (revision 0)
+++ docs/manual/developer/output-filters.xml    (revision 0)
@@ -0,0 +1,457 @@
+<?xml version="1.0" encoding="UTF-8" ?>
+<!DOCTYPE manualpage SYSTEM "../style/manualpage.dtd">
+<?xml-stylesheet type="text/xsl" href="../style/manual.en.xsl"?>
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+<manualpage metafile="output-filters.xml.meta">
+  <parentdocument href="./">Developer Documentation</parentdocument>
+
+  <title>Guide to writing output filters</title>
+  
+  <summary>
+    <p>There are a number of common pitfalls encountered when writing
+    output filters; this page aims to document best practice for
+    authors of new or existing filters.</p>
+
+    <p>This document is applicable to both version 2.0 and version 2.2
+    of the Apache HTTP Server; it specifically targets
+    <code>RESOURCE</code>-level or <code>CONTENT_SET</code>-level
+    filters though some advice is generic to all types of filter.</p>
+  </summary>
+
+  <section id="basics">
+    <title>Filters and bucket brigades</title>
+
+    <p>Each time a filter is invoked, it is passed a <em>bucket
+    brigade</em>, containing a sequence of <em>buckets</em> which
+    represent both data content and metadata.  Every bucket has a
+    <em>bucket type</em>; a number of bucket types are defined and
+    used by the <code>httpd</code> core modules (and the
+    <code>apr-util</code> library which provides the bucket brigade
+    interface), but modules are free to define their own types.</p>
+
+    <note type="hint">Output filters must be prepared to process
+    buckets of non-standard types; with a few exceptions, a filter
+    need not care about the types of buckets being filtered.</note>
+
+    <p>A filter can tell whether a bucket represents either data or
+    metadata using the <code>APR_BUCKET_IS_METADATA</code> macro.
+    Generally, all metadata buckets should be passed up the filter
+    chain by an output filter.  Filters may transform, delete, and
+    insert data buckets as appropriate.</p>
+
+    <p>There are two metadata bucket types which all filters must pay
+    attention to: the <code>EOS</code> bucket type, and the
+    <code>FLUSH</code> bucket type.  An <code>EOS</code> bucket
+    indicates that the end of the response has been reached and no
+    further buckets need be processed.  A <code>FLUSH</code> bucket
+    indicates that the filter should flush any buffered buckets (if
+    applicable) down the filter chain immediately.</p>
+
+    <note type="hint"><code>FLUSH</code> buckets are sent when the
+    content generator (or a downstream filter) knows that there may be
+    a delay before more content can be sent.  By passing
+    <code>FLUSH</code> buckets up the filter chain immediately,
+    filters ensure that the client is not kept waiting for pending
+    data longer than necessary.</note>
+
+    <p>Filters can create <code>FLUSH</code> buckets and pass these up
+    the filter chain if desired.  Generating <code>FLUSH</code>
+    buckets unnecessarily, or too frequently, can harm network
+    utilisation since it may force large numbers of small packets to
+    be sent, rather than a small number of larger packets.  The
+    section on <a href="#nonblock">Non-blocking bucket reads</a>
+    covers a case where filters are encouraged to generate
+    <code>FLUSH</code> buckets.</p>
+
+    <example><title>Example bucket brigade</title>
+    <pre>HEAP FLUSH FILE EOS</pre></example>
+
+    <p>This shows a bucket brigade which may be passed to a filter; it
+    contains two metadata buckets (<code>FLUSH</code> and
+    <code>EOS</code>), and two data buckets (<code>HEAP</code> and
+    <code>FILE</code>).</p>
+
+  </section>
+
+  <section id="invocation">
+    <title>Filter invocation</title>
+    
+    <p>For any given request, an output filter might be invoked only
+    once and given a single brigade representing the entire response.
+    It is also possible that the number of times a filter is invoked
+    is proportional to the size of the content being filtered, with
+    the filter being passed a brigade containing a single bucket each
+    time.  Filters must operate correctly in either case.</p>
+
+    <note type="warning">An output filter which allocates long-lived
+    memory every time it is invoked may consume memory proportional to
+    response size.  Output filters which need to allocate memory
+    should do so once per response; see <a href="#state">Maintaining
+    state</a> below.</note>
+
+    <p>An output filter can determine the final invocation for a given
+    response by the presence of an <code>EOS</code> bucket in the
+    brigade.  Any buckets in the brigade after an EOS should be
+    ignored.</p>
+
+    <p>An output filter should never pass an empty brigade up the
+    filter chain.  But, for good defensive programming, filters should
+    be prepared to accept an empty brigade, and do nothing.</p>
+
+    <example><title>How to handle an empty brigade</title>
+    
+    <pre>apr_status_t dummy_filter(ap_filter_t *f, apr_bucket_brigade *bb)
+{
+    if (APR_BRIGADE_EMPTY(bb)) {
+        return APR_SUCCESS;
+    }
+    ....</pre></example>
+
+  </section>
+
+  <section id="brigade">
+    <title>Brigade structure</title>
+
+    <p>A bucket brigade is a doubly-linked list of buckets.  The list
+    is terminated (at both ends) by a <em>sentinel</em> which can be
+    distinguished from a normal bucket by comparing it with the
+    pointer returned by <code>APR_BRIGADE_SENTINEL</code>.  The list
+    sentinel is in fact not a valid bucket structure; any attempt to
+    call normal bucket functions (such as
+    <code>apr_bucket_read</code>) on the sentinel will have undefined
+    behaviour (i.e. will crash the process).</p>
+
+    <p>There are a variety of functions and macros for traversing and
+    manipulating bucket brigades; see the <a
+    
href="http://apr.apache.org/docs/apr-util/trunk/group___a_p_r___util___bucket___brigades.html";>apr_bucket.h</a>
+    header for complete coverage.  Commonly used macros include:
+
+    <dl>
+      <dt><code>APR_BRIGADE_FIRST(bb)</code></dt>
+      <dd>returns the first bucket in brigade bb</dd>
+
+      <dt><code>APR_BRIGADE_LAST(bb)</code></dt>
+      <dd>returns the last bucket in brigade bb</dd>
+
+      <dt><code>APR_BUCKET_NEXT(e)</code></dt>
+      <dd>gives the next bucket after bucket e</dd>
+
+      <dt><code>APR_BUCKET_PREV(e)</code></dt>
+      <dd>gives the bucket before bucket e</dd>
+
+    </dl></p>
+
+    <p>The <code>apr_bucket_brigade</code> structure itself is
+    allocated out of a pool, so if a filter creates a new brigade, it
+    must ensure that memory use is correctly bounded.  A filter which
+    allocates a new brigade out of the request pool
+    (<code>r->pool</code>) on every invocation, for example, will fall
+    foul of the <a href="#invocation">warning above</a> concerning
+    memory use.  Such a filter should instead create a brigade on the
+    first invocation per request, and store that brigade in its <a
+    href="#state">state structure</a>.</p>
+
+    <note type="warning">It is generally never advisable to use
+    <code>apr_brigade_destroy</code> to "destroy" a brigade.  The
+    memory used by the brigade structure will not be released by
+    calling this function (since it comes from a pool), but the
+    associated pool cleanup is unregistered.  Using
+    <code>apr_brigade_destroy</code> can in fact cause memory leaks;
+    if a "destroyed" brigade contains still contains buckets when its
+    containing pool is destroyed, those buckets will <em>not</em> be
+    immediately destroyed.</note>
+
+  </section>
+
+  <section id="buckets">
+
+    <title>Processing buckets</title>
+
+    <p>When dealing with non-metadata buckets, it is important to
+    understand that the "<code>apr_bucket *</code>" object is an
+    abstract <em>representation</em> of data:
+
+    <ol>
+      <li>The amount of data represented by the bucket may or may not
+      have a determinate length; for a bucket which represents data of
+      indeterminate length, the <code>->length</code> field is set to
+      the value <code>(apr_size_t)-1</code>.  The <code>PIPE</code>
+      bucket type is an example of a bucket type has an indeterminate
+      length; it represents the output from a pipe, .</li>
+
+      <li>The data represented by a bucket may or may not be mapped
+      into memory.  The <code>FILE</code> bucket type, for example,
+      represents data stored in a file on disk.</li>
+    </ol>
+
+    Filters read the data from a bucket using the
+    <code>apr_bucket_read</code> function.  When this function is
+    invoked, the bucket may <em>morph</em> into a different bucket
+    type, and may also insert a new bucket into the bucket brigade.
+    This must happen for buckets which represent data not mapped into
+    memory.</p>
+
+    <p>To give an example; consider a bucket brigade containing a
+    single <code>FILE</code> bucket representing an entire file, 24
+    kilobytes in size:</p>
+
+    <example><pre>FILE(0K-24K)</pre></example>
+
+    <p>When this bucket is read, it will read a block of data from the
+    file, morph into a <code>HEAP</code> bucket to represent that
+    data, and return the data to the caller.  It also inserts a new
+    <code>FILE</code> bucket representing the remainder of the file;
+    after the <code>apr_bucket_read</code> call, the brigade looks
+    like:</p>
+
+    <example><pre>HEAP(8K) FILE(8K-24K)</pre></example>
+
+  </section>
+
+  <section id="filtering">
+    <title>Filtering brigades</title>
+
+    <p>The basic function of any output filter will be to iterate
+    through the passed-in brigade and transform (or simply examine)
+    the content in some manner.  The implementation of the iteration
+    loop is critical to producing a well-behaved output filter.</p>
+
+    <p>Taking an example which loops through the entire brigade as
+    follows:
+
+    <example><title>Bad output filter -- do not imitate!</title>
+    <pre>apr_bucket *e = APR_BRIGADE_FIRST(bb);
+const char *data;
+apr_size_t len;
+
+while (e != APR_BRIGADE_SENTINEL(bb)) {
+   apr_bucket_read(e, &amp;data, &amp;length, APR_BLOCK_READ);
+   e = APR_BUCKET_NEXT(e);
+}
+
+return ap_pass_brigade(bb);</pre></example>
+
+    The above implementation would consume memory proportional to
+    content size.  If passed a <code>FILE</code> bucket, for example,
+    the entire file contents would be read into memory as each
+    <code>apr_bucket_read</code> call morphed a <code>FILE</code>
+    bucket into a <code>HEAP</code> bucket.</p>
+
+    <p>In contrast, the implementation below will use consume a fixed
+    amount of memory to filter any brigade; a temporary brigade is
+    needed and must be allocated only once per response, see the <a
+    href="#state">Maintaining state</a> section.</p>
+
+    <example><title>Better output filter</title>
+
+    <pre>apr_bucket *e;
+const char *data;
+apr_size_t len;
+
+while ((e = APR_BRIGADE_FIRST(bb)) != APR_BRIGADE_SENTINEL(bb)) {
+   rv = apr_bucket_read(e, &amp;data, &amp;length, APR_BLOCK_READ);
+   if (rv) ...;
+   /* Remove bucket e from bb. */
+   APR_BUCKET_REMOVE(e);
+   /* Insert it into  temporary brigade. */
+   APR_BRIGADE_INSERT_HEAD(tmpbb);
+   /* Pass brigade upstream. */
+   rv = ap_pass_brigade(f->next, tmpbb);
+   if (rv) ...;
+   apr_brigade_cleanup(tmpbb);
+}</pre></example>
+
+  </section>
+
+  <section id="state">
+
+    <title>Maintaining state</title>
+    
+    <p>A filter which needs to maintain state over multiple
+    invocations per response can use the <code>->ctx</code> field of
+    its <code>ap_filter_t</code> structure.  It is typical to store a
+    temporary brigade in such a structure, to avoid having to allocate
+    a new brigade per invocation as described in the <a
+    href="#brigade">Brigade structure</a> section.</p>
+    
+  <example><title>Example code to maintain filter state</title>
+
+  <pre>struct dummy_state {
+   apr_bucket_brigade *tmpbb;
+   int filter_state;
+   ....
+};
+
+apr_status_t dummy_filter(ap_filter_t *f, apr_bucket_brigade *bb)
+{
+    struct dummy_state *state;
+
+    state = f->ctx;
+    if (state == NULL) {
+       /* First invocation for this response: initialise state structure. */
+       f->ctx = state = apr_palloc(sizeof *state, f->r->pool);
+       
+       state->tmpbb = apr_brigade_create(f->r->pool, f->c->bucket_alloc);
+       state->filter_state = ...;
+    }
+    ...</pre></example>
+    
+  </section>
+  
+  <section id="buffer">
+    <title>Buffering buckets</title>
+
+    <p>If a filter decides to store buckets beyond the duration of a
+    single filter function invocation (for example storing them in its
+    <code>->ctx</code> state structure), those buckets must be <em>set
+    aside</em>.  This is necessary because some bucket types provide
+    buckets which represent temporary resources (such as stack memory)
+    which will fall out of scope as soon as the filter chain completes
+    processing the brigade.</p>
+
+    <p>To setaside a bucket, the <code>apr_bucket_setaside</code>
+    function can be called.  Not all bucket types can be setaside, but
+    if successful, the bucket will have morphed to ensure it has a
+    lifetime at least as long as the pool given as an argument to the
+    <code>apr_bucket_setaside</code> function.</p>
+
+    <p>Alternatively, the <code>ap_save_brigade</code> function can be
+    used, which will create a new brigade containing buckets with a
+    lifetime as long as the given pool argument.  This function must
+    be used with great care, however: on return it guarantees that all
+    the buckets in the returned brigade will represent data mapped
+    into memory.  If given an input brigade containing, for example, a
+    PIPE bucket, <code>ap_save_brigade</code> will consume an
+    arbitrary amount of memory to store the entire output of the
+    pipe.</p>
+
+    <note type="warning">Filters must ensure that any buffered data is
+    processed and passed up the filter chain during the last
+    invocation for a given response (a brigade containing an EOS
+    bucket).  Otherwise such data will be lost.</note>
+
+  </section>
+
+  <section id="nonblock">
+    <title>Non-blocking bucket reads</title>
+
+    <p>The <code>apr_bucket_read</code> function takes an
+    <code>apr_read_type_e</code> argument which determines whether a
+    <em>blocking</em> or <em>non-blocking</em> read will be performed
+    from the data source.  A good filter will first attempt to read
+    from every data bucket using a non-blocking read; if that fails
+    with <code>APR_EAGAIN</code>, then send a <code>FLUSH</code>
+    bucket up the filter chain, and retry using a blocking read.</p>
+    
+    <p>This mode of operation ensure that any filters further up the
+    filter chain will flush any buffered buckets if a slow content
+    source is being used.</p>
+
+    <p>A CGI script is an example of a slow content source which is
+    implemented as a bucket type. <module>mod_cgi</module> will send
+    <code>PIPE</code> buckets which represent the output from a CGI
+    script; reading from such a bucket will block when waiting for the
+    CGI script to produce more output.</p>
+
+    <example>
+      <title>Example code using non-blocking bucket reads</title>
+
+<pre>apr_bucket *e;
+apr_read_type_e mode = APR_NONBLOCK_READ;
+
+while ((e = APR_BRIGADE_FIRST(bb)) != APR_BRIGADE_SENTINEL(bb)) {
+    apr_status_t rv;
+
+    rv = apr_bucket_read(e, &amp;data, &amp;length, mode);
+    if (rv == APR_EAGAIN &amp;&amp; mode == APR_NONBLOCK_READ) {
+        /* Pass up a brigade containing a flush bucket: */
+        APR_BRIGADE_INSERT_TAIL(tmpbb, apr_bucket_flush_create(...));
+        rv = ap_pass_brigade(f->next, tmpbb);
+        apr_brigade_cleanup(tmpbb);
+        if (rv != APR_SUCCESS) return rv;
+
+        /* Retry, using a blocking read. */
+        mode = APR_BLOCK_READ;
+        continue;
+    } else if (rv != APR_SUCCESS) { 
+        /* handle errors */
+    }
+
+    /* Next time, try a non-blocking read first. */
+    mode = APR_NONBLOCK_READ;
+    ...
+}</pre></example>
+
+  </section>
+
+  <section id="rules">
+    <title>Ten rules for output filters</title>
+
+    <p>In summary, here is a set of rules for all output filters to
+    follow:</p>
+
+    <ol>
+      <li>Output filters should not pass empty brigades up the filter
+      chain, but should be tolerant of being passed empty
+      brigades.</li>
+
+      <li>Output filters must pass all metadata buckets up the filter
+      chain; <code>FLUSH</code> buckets should be respected by passing
+      any pending or buffered buckets up the filter chain.</li>
+
+      <li>Output filters should ignore any buckets following an
+      <code>EOS</code> bucket.</li>
+
+      <li>Output filters which read all the buckets in a brigade must
+      process a fixed number of buckets (or amount of data) at a time,
+      to ensure that memory consumption is not proportional to the
+      size of the content being filtered.</li>
+      
+      <li>Output filters should be agnostic with respect to bucket
+      types, and must be able to process buckets of unfamiliar
+      type.</li>
+
+      <li>After calling <code>ap_pass_brigade</code> to pass a brigade
+      up the filter chain, output filters should call
+      <code>apr_brigade_clear</code> to ensure the brigade is empty
+      before reusing that brigade structure; output filters should
+      never use <code>apr_brigade_destroy</code> to "destroy"
+      brigades.</li>
+      
+      <li>Output filters must <em>setaside</em> any buckets which are
+      preserved beyond the duration of the filter function.</li>
+
+      <li>Output filters must not ignore the return value of
+      <code>ap_pass_brigade</code>, and must return appropriate errors
+      back down the filter chain.</li>
+
+      <li>Output filters must only create a fixed number of bucket
+      brigades for each response, rather than one per invocation.</li>
+
+      <li>Output filters should first attempt non-blocking reads from
+      each data bucket, and send a <code>FLUSH</code> bucket up the
+      filter chain if the read blocks, before retrying with a blocking
+      read.</li>
+
+    </ol>
+
+  </section>
+
+</manualpage>

Property changes on: docs/manual/developer/output-filters.xml
___________________________________________________________________
Name: svn:eol-style
   + native

Index: modules/experimental/config.m4
===================================================================
--- modules/experimental/config.m4      (revision 519147)
+++ modules/experimental/config.m4      (working copy)
@@ -4,5 +4,6 @@
 APACHE_MODULE(example, example and demo module, , , no)
 APACHE_MODULE(case_filter, example uppercase conversion filter, , , no)
 APACHE_MODULE(case_filter_in, example uppercase conversion input filter, , , 
no)
+APACHE_MODULE(buffer_filter, example output filter which buffers buckets, , , 
no)
 
 APACHE_MODPATH_FINISH
Index: modules/experimental/mod_buffer_filter.c
===================================================================
--- modules/experimental/mod_buffer_filter.c    (revision 0)
+++ modules/experimental/mod_buffer_filter.c    (revision 0)
@@ -0,0 +1,124 @@
+/* Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "httpd.h"
+#include "http_config.h"
+#include "apr_buckets.h"
+#include "apr_general.h"
+#include "apr_lib.h"
+#include "util_filter.h"
+#include "http_request.h"
+#include "http_log.h"
+
+struct buffer_filter_state {
+    apr_bucket_brigade *tmpbb;
+    apr_size_t tmplen;
+};
+
+#define MAX_BUFFER_BYTES (8000)
+
+static int buffer_filter(ap_filter_t *f, apr_bucket_brigade *bb)
+{
+    struct buffer_filter_state *state;
+    apr_read_type_e mode = APR_NONBLOCK_READ;
+    apr_bucket *e;
+
+    state = f->ctx;
+    if (state == NULL) {
+       /* First invocation for this response: initialise state structure. */
+        f->ctx = state = apr_palloc(f->r->pool, sizeof *state);
+       
+        state->tmpbb = apr_brigade_create(f->r->pool, f->c->bucket_alloc);
+        state->tmplen = 0;
+    }
+
+    /* Process passed-in brigade. */
+    while ((e = APR_BRIGADE_FIRST(bb)) != APR_BRIGADE_SENTINEL(bb)) {
+        apr_size_t length;
+        const char *data;
+        apr_status_t rv;
+
+        if (!APR_BUCKET_IS_METADATA(e)) {
+            rv = apr_bucket_read(e, &data, &length, mode);
+            if (APR_STATUS_IS_EAGAIN(rv) && mode == APR_NONBLOCK_READ) {
+                /* Pass up a brigade containing a flush bucket: */
+                APR_BRIGADE_INSERT_TAIL(state->tmpbb, 
+                                        
apr_bucket_flush_create(f->c->bucket_alloc));
+
+                rv = ap_pass_brigade(f->next, state->tmpbb);
+                apr_brigade_cleanup(state->tmpbb);
+                state->tmplen = 0;
+                if (rv != APR_SUCCESS) {
+                    return rv;
+                }
+                
+                /* Retry, using a blocking read. */
+                mode = APR_BLOCK_READ;
+                continue;
+            } 
+            else if (rv != APR_SUCCESS) { 
+                ap_log_rerror(APLOG_MARK, APLOG_ERR, rv, f->r,
+                              "could not read from bucket");
+                return APR_EGENERAL;
+            }
+            
+            /* Next time, try a non-blocking read first. */
+            mode = APR_NONBLOCK_READ;
+
+            state->tmplen += length;
+        }
+
+        APR_BUCKET_REMOVE(e);
+        APR_BRIGADE_INSERT_TAIL(state->tmpbb, e);
+        
+        if (APR_BUCKET_IS_FLUSH(e) || APR_BUCKET_IS_EOS(e) 
+            || state->tmplen >= MAX_BUFFER_BYTES) {
+            rv = ap_pass_brigade(f->next, state->tmpbb);
+            apr_brigade_cleanup(state->tmpbb);
+            state->tmplen = 0;
+            
+            if (rv) {
+                return rv;
+            }
+        }
+        else {
+            rv = apr_bucket_setaside(e, f->r->pool);
+            if (rv) {
+                ap_log_rerror(APLOG_MARK, APLOG_ERR, rv, f->r,
+                              "could not setaside bucket");
+                return APR_EGENERAL;
+            }
+        }
+    }
+    
+    return APR_SUCCESS;
+}
+
+static void register_hooks(apr_pool_t *p)
+{
+    ap_register_output_filter("BUFFER", buffer_filter, NULL, 
AP_FTYPE_RESOURCE);
+}
+
+module AP_MODULE_DECLARE_DATA buffer_module =
+{
+    STANDARD20_MODULE_STUFF,
+    NULL,                         /* dir config creater */
+    NULL,                         /* dir merger --- default is to override */
+    NULL,                         /* server config */
+    NULL,                         /* merge server config */
+    NULL,                         /* command apr_table_t */
+    register_hooks                /* register hooks */
+};

Property changes on: modules/experimental/mod_buffer_filter.c
___________________________________________________________________
Name: svn:eol-style
   + native

[RFC] Guide to writing output filters

Reply via email to