On Friday 19 March 2004 21:06, you wrote:
> DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
> RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
> <http://issues.apache.org/bugzilla/show_bug.cgi?id=27802>.
> ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
> INSERTED IN THE BUG DATABASE.
>
> http://issues.apache.org/bugzilla/show_bug.cgi?id=27802
>
> EncodeURLTransformer encodes off site links
>
>            Summary: EncodeURLTransformer encodes off site links
>            Product: Cocoon 2
>            Version: Current CVS 2.1
>           Platform: Other
>         OS/Version: Other
>             Status: NEW
>           Severity: Normal
>           Priority: Other
>          Component: sitemap components
>         AssignedTo: [EMAIL PROTECTED]
>         ReportedBy: [EMAIL PROTECTED]
>
>
> EncodeURLTransformer URL encodes all types of off-site links
> like news://foo.bla.com, http://other.host.net/foo/bla mailto:[EMAIL PROTECTED]
> ftp://foo.bla.com.
>
> I'm not sure if this is a bug in EncodeURLTransformer or in the
> encodeURL() method of the ServletContainer.
> Jetty and Tomcat behave the same in this regard.

Hi,

I've added regexp not just on element/attribute pairs, but on the attribtute 
value (the url) itself. 
This enables the EncodeURLTransformer to take care of references to
external sources. (mailto:, telnet:, ftp:, etc.)
The EncodeURLTransformer will mostly only transform relative links and will
leave the absolute ones as they are.
Document fragments refernced by <a href="#some_reference" ..> will not be 
touched per default, too.
The regexp urls patterns can be configured in the sitemap if the default is 
not suitable.
<map:transformer logger="sitemap.transformer.encodeURL" name="encodeURL" 
src="org.apache.cocoon.transformation.EncodeURLTransformer">
  <exclude-urls>http:.*|#.*|myprotocol.*</exclude-urls>
</map:transformer>

Furter, I've refactored the inner class ElementAttributeMatching to be  an 
outer class called 
org.apache.cocoon.transformation.helpers.ElementAttributeURLMatcher,
because I think it will be of use in 
org.apache.cocoon.blocks.portal.transformation.HTMLEventLinkTransformer which
has some of the problems EncodeURLTransformer has had, too.

I hope this patch doesn't introduce any new bugs.
Please review.

-- 
lg, Chris

--- cocoon-2.1/src/java/org/apache/cocoon/transformation/EncodeURLTransformer.java	2004-03-16 14:24:57.000000000 +0100
+++ org/apache/cocoon/transformation/EncodeURLTransformer.java	2004-03-25 19:50:19.000000000 +0100
@@ -29,9 +29,9 @@
 import org.apache.cocoon.environment.Session;
 import org.apache.cocoon.environment.SourceResolver;
 import org.apache.cocoon.transformation.AbstractTransformer;
+import org.apache.cocoon.transformation.helpers.ElementAttributeURLMatcher;
 import org.apache.excalibur.source.SourceValidity;
 import org.apache.excalibur.source.impl.validity.NOPValidity;
-import org.apache.regexp.RE;
 import org.apache.regexp.RESyntaxException;
 import org.xml.sax.Attributes;
 import org.xml.sax.SAXException;
@@ -52,6 +52,10 @@
  * <p>
  *   You can specify which attributes hold URL values in order to restrict
  *   URL rewriting to specific attributes only.
+ *   Additionally you can specify URL regular expressions to limit URL rewriting.
+ *   By default all URLs defined in http://www.iana.org/assignments/uri-schemes
+ *   are excluded. That means only relative URL's will be rewritten!
+ *   Relative document fragment URLs referenced by "#" are also excluded by default.  
  * </p>
  * <p>
  * Usage in a sitemap:
@@ -63,8 +67,9 @@
  *     ...
  *       &lt;map:transformer type=&quot;encodeURL&quot;
  *         src=&quot;org.apache.cocoon.optional.transformation.EncodeURLTransformer&quot;&gt;
- *         &lt;exclude-name&gt;img/@src&lt;/exclude-name&gt;
- *         &lt;include-name&gt;.&amp;asterik;/@href|.&amp;asterik;/@src|.&amp;asterik;/@action&lt;/include-name&gt;
+ *         &lt;exclude-name&gt;img/@src&lt;/exclude-name&gt; * 
+ *         &lt;include-name&gt;.&asterik;/@href|.&asterik;/@src|.&asterik;/@action&lt;/include-name&gt;
+ *         &lt;exclude-urls&gt;http:.&asterik;|https:.&asterik;|#.&asterik;|ftp:.&asterik;&lt;exclude-urls&gt;
  *       &lt;/map:transformer&gt;
  *   ...
  *   &lt;map:pipelines&gt;
@@ -94,21 +99,43 @@
     public final static String INCLUDE_NAME = "include-name";
 
     /**
+     * Configuration name for specifying excluding url patterns,
+     * ie exclude-urls.
+     */
+    public final static String EXCLUDE_URLS = "exclude-urls";    
+    
+    /**
      * Configuration default exclude pattern,
      * ie img/@src
      */
     public final static String EXCLUDE_NAME_DEFAULT = "img/@src";
 
     /**
-     * Configuration default exclude pattern,
+     * Configuration default include pattern,
      * ie .*\/@href|.*\/@action|frame/@src
      */
     public final static String INCLUDE_NAME_DEFAULT = ".*/@href|.*/@action|frame/@src";
 
+    /**
+     * Default URL that will not be encoded. List is from
+     * http://www.iana.org/assignments/uri-schemes  
+     */
+    public final static String EXCLUDE_URLS_DEFAULT =
+        "http:.*|https:.*|ftp:.*|#.*|mailto:.*|news:.*|" + 
+        "nntp:.*|telnet:.*|prospero:.*|z39.50s:.*|z39.50r:.*|" +
+        "cid:.*|mid:.*|vemmi:.*|service:.*|imap:.*|nfs:.*|" +
+        "acap:.*|rtsp:.*|tip:.*|pop:.*|data:.*|dav:.*|gopher:.*|" +
+        "opaquelocktoken:.*|sip:.*|sips:.*|tel:.*|fax:.*|" +
+        "modem:.*|ldap:.*|soap.beep:.*|soap.beeps:.*|afs:.*|" +
+        "xmlrpc.beep:.*|xmlrpc.beeps:.*|urn:.*|go:.*|h323:.*|" +
+        "ipp:.*|tftp:.*|mupdate:.*|pres:.*|im:.*|wais:.*|" +
+        "file:.*|tn3270:.*|mailserver:.*";
+    
     private String includeNameConfigure = INCLUDE_NAME_DEFAULT;
     private String excludeNameConfigure = EXCLUDE_NAME_DEFAULT;
+    private String excludeURLsConfigure = EXCLUDE_URLS_DEFAULT; 
 
-    private ElementAttributeMatching elementAttributeMatching;
+    private ElementAttributeURLMatcher matcher;
     private Response response;
     private boolean isEncodeURLNeeded;
     private Session session;
@@ -158,8 +185,12 @@
                                                                this.includeNameConfigure);
             final String excludeName = parameters.getParameter(EXCLUDE_NAME,
                                                                this.excludeNameConfigure);
+            final String excludeURLs = parameters.getParameter(EXCLUDE_URLS,
+                                                               this.excludeURLsConfigure);            
+                        
             try {
-                this.elementAttributeMatching = new ElementAttributeMatching(includeName, excludeName);
+                this.matcher = new ElementAttributeURLMatcher(includeName, excludeName,
+                                                              excludeURLs);
             } catch (RESyntaxException reex) {
                 final String message = "Cannot parse include-name: " + includeName + " " +
                     "or exclude-name: " + excludeName + "!";
@@ -184,6 +215,9 @@
         child = configuration.getChild(EXCLUDE_NAME);
         this.excludeNameConfigure = child.getValue(EXCLUDE_NAME_DEFAULT);
 
+        child = configuration.getChild(EXCLUDE_URLS);
+        this.excludeURLsConfigure = child.getValue(EXCLUDE_URLS_DEFAULT);
+
         if (this.includeNameConfigure == null) {
             String message = "Configure " + INCLUDE_NAME + "!";
             throw new ConfigurationException(message);
@@ -192,6 +226,10 @@
             String message = "Configure " + EXCLUDE_NAME + "!";
             throw new ConfigurationException(message);
         }
+        if (this.excludeURLsConfigure == null) {
+            String message = "Configure " + EXCLUDE_URLS + "!";
+            throw new ConfigurationException(message);
+        }        
     }
 
 
@@ -202,7 +240,7 @@
         super.recycle();
         this.response = null;
         this.session = null;
-        this.elementAttributeMatching = null;
+        this.matcher = null;
     }
 
 
@@ -245,7 +283,7 @@
      */
     public void startElement(String uri, String name, String raw, Attributes attributes)
     throws SAXException {
-        if (this.isEncodeURLNeeded && this.elementAttributeMatching != null) {
+        if (this.isEncodeURLNeeded && this.matcher != null) {
             String lname = name;
             if (attributes != null && attributes.getLength() > 0) {
                 AttributesImpl new_attributes = new AttributesImpl(attributes);
@@ -254,7 +292,7 @@
 
                     String value = new_attributes.getValue(i);
 
-                    if (elementAttributeMatching.matchesElementAttribute(lname, attr_lname)) {
+                    if (matcher.matchesElementAttributeURL(lname, attr_lname, value)) {
                         // don't use simply this.response.encodeURL(value)
                         // but be more smart about the url encoding
                         final String new_value = this.encodeURL(value);
@@ -300,105 +338,5 @@
         }
         return encoded_url;
     }
-    
-    /**
-     * A helper class for matching element names, and attribute names.
-     *
-     * <p>
-     *  For given include-name, exclude-name decide if element-attribute pair
-     *  matches. This class defines the precedence and matching algorithm.
-     * </p>
-     *
-     * @author     <a href="mailto:[EMAIL PROTECTED]">Bernhard Huber</a>
-     * @version    CVS $Id: EncodeURLTransformer.java,v 1.7 2004/03/06 04:58:58 joerg Exp $
-     */
-    public class ElementAttributeMatching {
-        /**
-         * Regular expression of including patterns
-         *
-         */
-        protected RE includeNameRE;
-        /**
-         * Regular expression of excluding patterns
-         *
-         */
-        protected RE excludeNameRE;
-
-
-        /**
-         *Constructor for the ElementAttributeMatching object
-         *
-         * @param  includeName            Description of Parameter
-         * @param  excludeName            Description of Parameter
-         * @exception  RESyntaxException  Description of Exception
-         */
-        public ElementAttributeMatching(String includeName, String excludeName) throws RESyntaxException {
-            includeNameRE = new RE(includeName, RE.MATCH_CASEINDEPENDENT);
-            excludeNameRE = new RE(excludeName, RE.MATCH_CASEINDEPENDENT);
-        }
-
-
-        /**
-         * Return true iff element_name attr_name pair is not matched by exclude-name,
-         * but is matched by include-name
-         *
-         * @param  element_name
-         * @param  attr_name
-         * @return               boolean true iff value of attribute_name should get rewritten, else
-         *   false.
-         */
-        public boolean matchesElementAttribute(String element_name, String attr_name) {
-            String element_attr_name = canonicalizeElementAttribute(element_name, attr_name);
-
-            if (excludeNameRE != null && includeNameRE != null) {
-                return !matchesExcludesElementAttribute(element_attr_name) &&
-                        matchesIncludesElementAttribute(element_attr_name);
-            } else {
-                return false;
-            }
-        }
-
-
-        /**
-         * Build from elementname, and attribute name a single string.
-         * <p>
-         *   String concatenated <code>element name + "/@" + attribute name</code>
-         *   is matched against the include and excluding patterns.
-         * </p>
-         *
-         * @param  element_name  Description of Parameter
-         * @param  attr_name     Description of Parameter
-         * @return               Description of the Returned Value
-         */
-        private String canonicalizeElementAttribute(String element_name, String attr_name) {
-            return element_name + "/@" + attr_name;
-        }
-
-
-        /**
-         * Return true iff element_name attr_name pair is matched by exclude-name.
-         *
-         * @param  element_attr_name
-         * @return                    boolean true iff exclude-name matches element_name, attr_name, else
-         *   false.
-         */
-        private boolean matchesExcludesElementAttribute(String element_attr_name) {
-            boolean match = excludeNameRE.match(element_attr_name);
-            return match;
-        }
-
-
-        /**
-         * Return true iff element_name attr_name pair is matched by include-name.
-         *
-         * @param  element_attr_name
-         * @return                    boolean true iff include-name matches element_name, attr_name, else
-         *   false.
-         */
-        private boolean matchesIncludesElementAttribute(String element_attr_name) {
-            boolean match = includeNameRE.match(element_attr_name);
-            return match;
-        }
-    }
 }
 
/*
 
   Copyright 2003 Bernhard Huber <[EMAIL PROTECTED]>

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

*/ 
 
package org.apache.cocoon.transformation.helpers;

import org.apache.regexp.RE;
import org.apache.regexp.RESyntaxException;


/**
 * A helper class for matching element names, attribute names and url
 * regular expressions.
 *
 * <p>
 *  For given include-name, exclude-name and url regexp decide if
 *  element-attribute pair and url matches.
 *  This class defines the precedence and matching algorithm.
 * </p>
 *
 * @author     <a href="mailto:[EMAIL PROTECTED]">Bernhard Huber</a>
 * @version    CVS $Id: EncodeURLTransformer.java,v 1.7 2004/03/06 04:58:58 joerg Exp $
 */
public class ElementAttributeURLMatcher {
    /**
     * Regular expression of including patterns
     *
     */
    protected RE includeNameRE;
    /**
     * Regular expression of excluding patterns
     *
     */
    protected RE excludeNameRE;
    
    /**
     * Regular expression of excluding urls
     *
     */
    protected RE excludeURLsRE;        


    /**
     *Constructor for the ElementAttributeURLMatching object
     *
     * @param  includeName            Description of Parameter
     * @param  excludeName            Description of Parameter
     * @exception  RESyntaxException  Description of Exception
     */
    public ElementAttributeURLMatcher(String includeName,
        String excludeName, String excludeURLs)
    throws RESyntaxException {
        includeNameRE = new RE(includeName, RE.MATCH_CASEINDEPENDENT);
        excludeNameRE = new RE(excludeName, RE.MATCH_CASEINDEPENDENT);
        excludeURLsRE = new RE(excludeURLs, RE.MATCH_CASEINDEPENDENT);
    }


    /**
     * Return true iff element_name attr_name pair is not matched by exclude-name,
     * but is matched by include-name
     *
     * @param  element_name
     * @param  attr_name
     * @return               boolean true iff value of attribute_name should get rewritten, else
     *   false.
     */
    public boolean matchesElementAttributeURL(String element_name,
        String attr_name, String attr_value) {
        String element_attr_name = canonicalizeElementAttribute(element_name, attr_name);

        if (excludeNameRE != null && includeNameRE != null && excludeURLsRE != null) {
            return !matchesExcludesElementAttribute(element_attr_name) &&
                   matchesIncludesElementAttribute(element_attr_name) &&
                   !matchesExcludesURLs(attr_value);
        } else {
            return false;
        }
    }


    /**
     * Build from elementname, and attribute name a single string.
     * <p>
     *   String concatenated <code>element name + "/@" + attribute name</code>
     *   is matched against the include and excluding patterns.
     * </p>
     *
     * @param  element_name  Description of Parameter
     * @param  attr_name     Description of Parameter
     * @return               Description of the Returned Value
     */
    private String canonicalizeElementAttribute(String element_name, String attr_name) {
        return element_name + "/@" + attr_name;
    }


    /**
     * Return true if element_name attr_name pair is matched by exclude-name.
     *
     * @param  element_attr_name
     * @return                    boolean true iff exclude-name matches element_name, attr_name, else
     *   false.
     */
    private boolean matchesExcludesElementAttribute(String element_attr_name) {
        boolean match = excludeNameRE.match(element_attr_name);
        return match;
    }


    /**
     * Return true if element_name attr_name pair is matched by include-name.
     *
     * @param  element_attr_name
     * @return                    boolean true iff include-name matches element_name, attr_name, else
     *   false.
     */
    private boolean matchesIncludesElementAttribute(String element_attr_name) {
        boolean match = includeNameRE.match(element_attr_name);
        return match;
    }            

    
    /**
     * Test if the url is in the URL exclude list.
     * @return true if the URL matches.  
     */
    protected boolean matchesExcludesURLs(String url) {
        boolean match = excludeURLsRE.match(url);
        return match; 
    } 
}

Reply via email to