Regex based publisher proposal

Sébastien Arnaud Wed, 06 Sep 2006 21:59:30 -0700

Hi,

I have been following with passion mod_python development for quite a while now, and in the light of a few emails over the past few months discussing web frameworks in mod_python, I decided I would attempt to contribute to the project in order to move towards a fast, flexible MVC mod_python only based web framework.

I have written 2 or 3 different ones along the past couple of years, but nothing worthy of sharing by any mean. They have helped me however to define what would be the "dream" web framework for mod_python, but more importantly to identify the needed plumbing improvements to mod_python.

One of the first needed improvements, in my opinion, is the capacity to route web requests in a more flexible manner than via the current publisher module. So, I would like to propose the following module (pubre.py). It is basically a copy of the mod_python.publisher module to the exception that a lot of the core handler code has been modified to use regex in order to route a web request to the appropriate module/function.

I have been developing against mod_python/trunk and I attached the file for whoever wants to review it and give it a try. Keep in mind though it is still probably rough around the edges and not any solid testing has been performed yet. I only performed some trivial benchmarking/stress testing to make sure that performance wise it was on par with the current mod_python.publisher.

The default behavior is suppose to be 100% compatible with the way mod_python.publisher behaves. Eventually though you would be able to pass as a PythonOption the grammar of the urls in your web application, by simply declaring something like:


<Directory "/mypath/mydir/">
        AddHandler mod_python .py .html
        PythonHandler mod_python.pubre

PythonOption "pubregex" "(?P<controller>[\w]+)?(\.(?P<extension>[\w] +))?(/(?P<action>[^/]+))?(\?$)?"

</Directory>

I know that not all grammars will work with the current version attached (due to some code being still dependent on the conservative url structure /path/file.ext), eventually though, I hope I can get this solved and allow any regex grammar to work.

Anyway, please share your comments and feedback to make sure I am headed in the right direction by keeping in mind that my first goal is to be able to publish using a defined regex url grammar a callable class within a module. I believe that once this first step is accomplished the real design of the web framework can begin.


Cheers!

Sébastien

 #
 # Copyright 2004 Apache Software Foundation 
 # 
 # Licensed under the Apache License, Version 2.0 (the "License"); you
 # may not use this file except in compliance with the License.  You
 # may obtain a copy of the License at
 #
 #      http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied.  See the License for the specific language governing
 # permissions and limitations under the License.
 #
 # Originally developed by Gregory Trubetskoy.
 #
 # $Id: publisher.py 384754 2006-03-10 10:20:06Z grahamd $

"""
  This handler is conceptually similar to Zope's ZPublisher, except
  that it:

  1. Is written specifically for mod_python and is therefore much faster
  2. Does not require objects to have a documentation string
  3. Passes all arguments as simply string
  4. Does not try to match Python errors to HTTP errors
  5. Does not give special meaning to '.' and '..'.
"""

import apache
import util

import sys
import os
from os.path import exists, isabs, normpath, split, isfile, join, dirname
import imp
import re
import base64

import new
import types
from types import *

imp_suffixes = " ".join([x[0][1:] for x in imp.get_suffixes()])

####################### The published page cache ##############################

from cache import ModuleCache, NOT_INITIALIZED

class PageCache(ModuleCache):
    """ This is the cache for page objects. Handles the automatic reloading of pages. """
    
    def key(self, req):
        """ Extracts the normalized filename from the request """
        return req.filename
    
    def check(self, key, req, entry):
        config = req.get_config()
        autoreload=int(config.get("PythonAutoReload", 1))
        if autoreload==0 and entry._value is not NOT_INITIALIZED:
            # if we don't want to reload and we have a value,
            # then we consider it fresh
            return None
        else:
            return ModuleCache.check(self, key, req, entry)

    def build(self, key, req, opened, entry):
        config = req.get_config()
        log=int(config.get("PythonDebug", 0))
        if log:
            if entry._value is NOT_INITIALIZED:
                req.log_error('Publisher loading page %s'%req.filename, apache.APLOG_NOTICE)
            else:
                req.log_error('Publisher reloading page %s'%req.filename, apache.APLOG_NOTICE)
        return ModuleCache.build(self, key, req, opened, entry)

page_cache = PageCache()


###################### The publisher regex mapper ##############################

class Mapper:
	""" This is the object to cache the regex engine """
	regex = "(?P<controller>[\w]+)?(\.(?P<extension>[\w]+))?(/(?P<action>[^/]+))?(\?$)?"
	regex_compared = 0

	def __init__(self):
		self.reobj = re.compile(self.regex)

	def __call__(self, uri, cre):
		if(cre!=None and not self.regex_compared and cre!=self.regex):
			self.regex = cre
			self.reobj = re.compile(self.regex)
			self.regex_compared = 1
		m = self.reobj.match(uri)
		if m:
			return (m.group('controller'), m.group('extension'), m.group('action'))
		else:
			return (None, None, None)
	
mapper_cache = Mapper()




####################### The publisher handler himself ##########################    

def handler(req):

    req.allow_methods(["GET", "POST", "HEAD"])
    if req.method not in ["GET", "POST", "HEAD"]:
        raise apache.SERVER_RETURN, apache.HTTP_METHOD_NOT_ALLOWED

    path,module_name =  os.path.split(req.filename)

    # Trimming the front part of req.uri
    if module_name=='':
        req_url = ''
    else:
        req_url = req.uri[req.uri.index(module_name):]
    
    # Retrieve custom regex mapping if any
    # Warning, depending on the custom regex passed along
    # some of the code in handler might need tweaking
    # to make sure all is functional (missing . comes to mind)
    try:
        custom_regex = req.get_options()["pubregex"]
    except KeyError:
    	custom_regex = None 
    
	# Use the mapper_cache obj to determine 
	# the controller, extension and action requested
    controller, extension, action = mapper_cache(req_url, custom_regex)

	# Set defaults if None values returned
    if controller==None:  
        controller = 'index'
	if extension==None:
		extension = 'html'
    if action==None:
    	action = 'index'

    # Now determine the actual Python module code file
    # to load. This will first try looking for the file
    # '/path/<module_name>.py'.
    req.filename = path + '/' + controller + '.py'
    if not exists(req.filename):
        raise apache.SERVER_RETURN, apache.HTTP_NOT_FOUND

    # Normalise req.filename to avoid Win32 issues.
    req.filename = normpath(req.filename)

    # We use the page cache to load the module
    module = page_cache[req]

    # does it have an __auth__?
    realm, user, passwd = process_auth(req, module)

    # resolve the object ('traverse')
    object = resolve_object(req, module, action, realm, user, passwd)

    # publish the object
    published = publish_object(req, object)
    
    # we log a message if nothing was published, it helps with debugging
    if (not published) and (req.bytes_sent==0) and (req.next is None):
        log=int(req.get_config().get("PythonDebug", 0))
        if log:
            req.log_error("mod_python.publisher: nothing to publish.")

    return apache.OK

def process_auth(req, object, realm="unknown", user=None, passwd=None):

    found_auth, found_access = 0, 0

    if hasattr(object, "__auth_realm__"):
        realm = object.__auth_realm__

    func_object = None

    if type(object) is FunctionType:
        func_object = object
    elif type(object) == types.MethodType:
        func_object = object.im_func

    if func_object:
        # functions are a bit tricky

        func_code = func_object.func_code
        func_globals = func_object.func_globals

        if "__auth__" in func_code.co_names:
            i = list(func_code.co_names).index("__auth__")
            __auth__ = func_code.co_consts[i+1]
            if hasattr(__auth__, "co_name"):
                __auth__ = new.function(__auth__, func_globals)
            found_auth = 1

        if "__access__" in func_code.co_names:
            # first check the constant names
            i = list(func_code.co_names).index("__access__")
            __access__ = func_code.co_consts[i+1]
            if hasattr(__access__, "co_name"):
                __access__ = new.function(__access__, func_globals)
            found_access = 1

        if "__auth_realm__" in func_code.co_names:
            i = list(func_code.co_names).index("__auth_realm__")
            realm = func_code.co_consts[i+1]

    else:
        if hasattr(object, "__auth__"):
            __auth__ = object.__auth__
            found_auth = 1
        if hasattr(object, "__access__"):
            __access__ = object.__access__
            found_access = 1

    if found_auth or found_access:
        # because ap_get_basic insists on making sure that AuthName and
        # AuthType directives are specified and refuses to do anything
        # otherwise (which is technically speaking a good thing), we
        # have to do base64 decoding ourselves.
        #
        # to avoid needless header parsing, user and password are parsed
        # once and the are received as arguments
        if not user and req.headers_in.has_key("Authorization"):
            try:
                s = req.headers_in["Authorization"][6:]
                s = base64.decodestring(s)
                user, passwd = s.split(":", 1)
            except:
                raise apache.SERVER_RETURN, apache.HTTP_BAD_REQUEST

    if found_auth:

        if not user:
            # note that Opera supposedly doesn't like spaces around "=" below
            s = 'Basic realm="%s"' % realm
            req.err_headers_out["WWW-Authenticate"] = s
            raise apache.SERVER_RETURN, apache.HTTP_UNAUTHORIZED

        if callable(__auth__):
            rc = __auth__(req, user, passwd)
        else:
            if type(__auth__) is DictionaryType:
                rc = __auth__.has_key(user) and __auth__[user] == passwd
            else:
                rc = __auth__
            
        if not rc:
            s = 'Basic realm = "%s"' % realm
            req.err_headers_out["WWW-Authenticate"] = s
            raise apache.SERVER_RETURN, apache.HTTP_UNAUTHORIZED

    if found_access:

        if callable(__access__):
            rc = __access__(req, user)
        else:
            if type(__access__) in (ListType, TupleType):
                rc = user in __access__
            else:
                rc = __access__

        if not rc:
            raise apache.SERVER_RETURN, apache.HTTP_FORBIDDEN

    return realm, user, passwd

### Those are the traversal and publishing rules ###

# tp_rules is a dictionary, indexed by type, with tuple values.
# The first item in the tuple is a boolean telling if the object can be traversed (default is True)
# The second item in the tuple is a boolen telling if the object can be published (default is True)
tp_rules = {}

# by default, built-in types cannot be traversed, but can be published
default_builtins_tp_rule = (False, True)
for t in types.__dict__.values():
    if isinstance(t, type):
        tp_rules[t]=default_builtins_tp_rule

# those are the exceptions to the previous rules
tp_rules.update({
    # Those are not traversable nor publishable
    ModuleType          : (False, False),
    BuiltinFunctionType : (False, False),
    
    # This may change in the near future to (False, True)
    ClassType           : (False, False),
    TypeType            : (False, False),
    
    # Publishing a generator may not seem to makes sense, because
    # it can only be done once. However, we could get a brand new generator
    # each time a new-style class property is accessed.
    GeneratorType       : (False, True),
    
    # Old-style instances are traversable
    InstanceType        : (True, True),
})

# types which are not referenced in the tp_rules dictionary will be traversable
# AND publishable 
default_tp_rule = (True, True)

def resolve_object(req, obj, object_str, realm=None, user=None, passwd=None):
    """
    This function traverses the objects separated by .
    (period) to find the last one we're looking for.
    """
    parts = object_str.split('.')

    first_object = True        
    for obj_str in parts:
        # path components starting with an underscore are forbidden
        if obj_str[0]=='_':
            req.log_error('Cannot traverse %s in %s because '
                          'it starts with an underscore'
                          % (obj_str, req.unparsed_uri), apache.APLOG_WARNING)
            raise apache.SERVER_RETURN, apache.HTTP_FORBIDDEN

        if first_object:
            first_object = False
        else:
            # if we're not in the first object (which is the module)
            # we're going to check whether be can traverse this type or not
            rule = tp_rules.get(type(obj), default_tp_rule)
            if not rule[0]:
                req.log_error('Cannot traverse %s in %s because '
                              '%s is not a traversable object'
                              % (obj_str, req.unparsed_uri, obj), apache.APLOG_WARNING)
                raise apache.SERVER_RETURN, apache.HTTP_FORBIDDEN
        
        # we know it's OK to call getattr
        # note that getattr can really call some code because
        # of property objects (or attribute with __get__ special methods)...
        try:
            obj = getattr(obj, obj_str)
        except AttributeError:
            raise apache.SERVER_RETURN, apache.HTTP_NOT_FOUND

        # we process the authentication for the object
        realm, user, passwd = process_auth(req, obj, realm, user, passwd)
    
    # we're going to check if the final object is publishable
    rule = tp_rules.get(type(obj), default_tp_rule)
    if not rule[1]:

         req.log_error('Cannot publish %s in %s because '
                       '%s is not publishable'
                       % (obj_str, req.unparsed_uri, obj), apache.APLOG_WARNING)
         raise apache.SERVER_RETURN, apache.HTTP_FORBIDDEN

    return obj

# This regular expression is used to test for the presence of an HTML header
# tag, written in upper or lower case.
re_html = re.compile(r"</HTML\s*>\s*$",re.I)
re_charset = re.compile(r"charset\s*=\s*([^\s;]+)",re.I);

def publish_object(req, object):
    if callable(object):
        
        # To publish callables, we call them an recursively publish the result
        # of the call (as done by util.apply_fs_data)
        
        req.form = util.FieldStorage(req, keep_blank_values=1)
        return publish_object(req,util.apply_fs_data(object, req.form, req=req))

# TODO : we removed this as of mod_python 3.2, let's see if we can put it back
# in mod_python 3.3    
#     elif hasattr(object,'__iter__'):
#     
#         # To publish iterables, we recursively publish each item
#         # This way, generators can be published
#         result = False
#         for item in object:
#             result |= publish_object(req,item)
#         return result
#         
    else:
        if object is None:
            
            # Nothing to publish
            return False
            
        elif isinstance(object,UnicodeType):
            
            # We've got an Unicode string to publish, so we have to encode
            # it to bytes. We try to detect the character encoding
            # from the Content-Type header
            if req._content_type_set:

                charset = re_charset.search(req.content_type)
                if charset:
                    charset = charset.group(1)
                else:
                    # If no character encoding was set, we use UTF8
                    charset = 'UTF8'
                    req.content_type += '; charset=UTF8'

            else:
                # If no character encoding was set, we use UTF8
                charset = 'UTF8'
                
            result = object.encode(charset)
        else:
            charset = None
            result = str(object)
            
        if not req._content_type_set:
            # make an attempt to guess content-type
            # we look for a </HTML in the last 100 characters.
            # re.search works OK with a negative start index (it starts from 0
            # in that case)
            if re_html.search(result,len(result)-100):
                req.content_type = 'text/html'
            else:
                req.content_type = 'text/plain'
            if charset is not None:
                req.content_type += '; charset=%s'%charset
        
        # Write result even if req.method is 'HEAD' as Apache
        # will discard the final output anyway. Truncating
        # output here if req.method is 'HEAD' is likely the
        # wrong thing to do as it may cause problems for any
        # output filters. See MODPYTHON-105 for details. We
        # also do not flush output as that will prohibit use
        # of 'CONTENT_LENGTH' filter to add 'Content-Length'
        # header automatically. See MODPYTHON-107 for details.
        req.write(result, 0)

        return True

PGP.sig
Description: This is a digitally signed message part

Regex based publisher proposal

Reply via email to