Re: Update on HTMLParser

Jordi Salvat i Alabart Wed, 15 Oct 2003 08:23:43 -0700

Hi Peter.

Please find a regexp-based solution (still suboptimal, still untested)
attached. Could you give it a quick try with the same test you're using
for your HTMLParser one to see if it's worth keeping working on it?

Thanks,

Jordi.

peter lin wrote:

I've updated the PDF with additional results. I ran the a benchmark using default tomcat pages in console mode. Although there isn't a lot of difference in memory and CPU usage, it is consistently less than with Tidy. The big improvement in console mode for 5 clients is the throughput in JMeter goes from 397 to 1075. Roughly 2.7x higher throughput. to me that looks pretty impressive and should make it easier for others to load test servers with fewer Jmeter clients running. I should be done with duplicating the existing functionality in the next day or two. I will make a list of the features people have requested for parsing HTML and start working on the ones that seem to add the high value first. peter

Jordi Salvat i Alabart <[EMAIL PROTECTED]> wrote: I was not thinking about using regexps instead of a decent HTML parser, but if they were really faster, it could well be worth having both methods available. It would need to be _really_ faster to be worth the hassle, but from experience I know it could well be (although you also gave reasons to think it won't be).

You're right that HTML is dirty and the regexps will be difficult, but I'm familiar with the issue and already have some previously used in Perl scripts... for example, get all image URIs by:

(?si)]*?\sSRC\s*=\s*"([^">]*)"

Others are more difficult -- for example stylesheets:
m{(?si)
]*?\s(?:HREF\s*=\s*"([^">]*)"|REL\s*=\s*"stylesheet")){2,}}g
I'll give it a shot so that we can compare -- it's important, because I've seen that processing responses is one of JMeter's biggest CPU hogs. We will probably be able to use the results for extractors, too.

/*
 * ====================================================================
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001-2003 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 * notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 * notice, this list of conditions and the following disclaimer in
 * the documentation and/or other materials provided with the
 * distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 * if any, must include the following acknowledgment:
 * "This product includes software developed by the
 * Apache Software Foundation (http://www.apache.org/)."
 * Alternately, this acknowledgment may appear in the software itself,
 * if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names "Apache" and "Apache Software Foundation" and
 * "Apache JMeter" must not be used to endorse or promote products
 * derived from this software without prior written permission. For
 * written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called "Apache",
 * "Apache JMeter", nor may "Apache" appear in their name, without
 * prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * ====================================================================
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * <http://www.apache.org/>.
 */
package org.apache.jmeter.protocol.http.sampler;


import java.io.ByteArrayInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Set;
import java.util.LinkedHashSet;
import java.util.Iterator;

import junit.framework.TestCase;

import org.apache.jmeter.samplers.Entry;
import org.apache.jmeter.samplers.SampleResult;
import org.apache.jorphan.logging.LoggingManager;
import org.apache.log.Logger;

// TODO: look at using Java 1.4 regexp instead of ORO.
import org.apache.oro.text.regex.MatchResult;
import org.apache.oro.text.regex.Pattern;
import org.apache.oro.text.regex.PatternMatcherInput;
import org.apache.oro.text.regex.Perl5Compiler;
import org.apache.oro.text.regex.Perl5Matcher;
import org.apache.oro.text.regex.MalformedPatternException;

/**
 * A sampler that downloads downloadable components such as images, applets,
 * etc.
 * <p>
 * For HTML files, this class will download binary files specified in the
 * following ways (where <b>url</b> represents the binary file to be
 * downloaded):
 * <ul>
 *  <li>&lt;img src=<b>url</b> ... &gt;
 *  <li>&lt;applet code=<b>url</b> ... &gt;
 *  <li>&lt;input type=image src=<b>url</b> ... &gt;
 *  <li>&lt;body background=<b>url</b> ... &gt;
 * </ul>
 *
 * Note that files that are duplicated within the enclosing document will
 * only be downloaded once. Also, the processing does not take account of the
 * following parameters:
 * <ul>
 *  <li>&lt;base href=<b>url</b>&gt; TODO: REVIEW - THIS WORKS NOW
 *  <li>&lt; ... codebase=<b>url</b> ... &gt;
 * </ul>
 *
 * The following parameters are not accounted for either (as the textbooks
 * say, they are left as an exercise for the interested reader):
 * <ul>
 *  <li>&lt;applet ... codebase=<b>url</b> ... &gt;
 *  <li>&lt;area href=<b>url</b> ... &gt;
 *  <li>&lt;embed src=<b>url</b> ... &gt;
 *  <li>&lt;embed codebase=<b>url</b> ... &gt;
 *  <li>&lt;object codebase=<b>url</b> ... &gt;
 *  <li>&lt;table background=<b>url</b> ... &gt;
 *  <li>&lt;td background=<b>url</b> ... &gt;
 *  <li>&lt;tr background=<b>url</b> ... &gt;
 * </ul>
 * TODO: REVIEW THIS COMMENTS - SOME OF THESE ARE IMPLEMENTED NOW
 *
 * Due to the recent refactoring of this class, these might not be as difficult
 * as they once might have been.
 * <p>
 * Finally, this class does not process <b>Style Sheets</b> either.
 *
 * @author  Khor Soon Hin
 * @author  <a href="mailto:[EMAIL PROTECTED]">Martin Ramshaw</a>
 * @version $Id: HTTPSamplerFull.java,v 1.15 2003/09/07 18:58:17 sebb Exp $
 */
public class HTTPSamplerFull
{
    /**
     * Regular expression used against the HTML code to find the URIs of
     * images, etc.:
     */
    private static final String REGEXP=
        "<BASE(?=\\s)[^\\>]*?\\sHREF\\s*=\\s*\"([^\">]*)\""
        +"|<(?:IMG|SCRIPT)(?=\\s)[^\\>]*?\\sSRC\\s*=\\s*\"([^\">]*)\"";/*
        +"|<APPLET(?=\\s)[^\\>]*?\\sCODE(?:BASE)?\\s*=\\s*\"([^\">]*)\""
        +"|<(?:EMBED|OBJECT)(?=\\s)[^\\>]*?\\s(?:SRC|CODEBASE)\\s*=\\s*\"([^\">]*)\""
        +"|<(?:BODY|TABLE|TR|TD)(?=\\s)[^\\>]*?\\sBACKGROUND\\s*=\\s*\"([^\">]*)\""
        +"|<INPUT(?=\\s)(?:[^\\>]*?\\s(?:SRC\\s*=\\s*\"([^\">]*)\"|TYPE\\s*=\\s*\"image\")){2,}"
        +"|<LINK(?=\\s)(?:[^\\>]*?\\s(?:HREF\\s*=\\s*\"([^\">]*)\"|REL\\s*=\\s*\"stylesheet\")){2,}";*/

    /**
     * Compiled regular expression.
     */
    Pattern pattern;

    /**
     * Thread-local matcher:
     */
    private static ThreadLocal localMatcher = new ThreadLocal()
    {
        protected Object initialValue()
        {
            return new Perl5Matcher();
        }
    };

    /**
     * Thread-local input:
     */
    private static ThreadLocal localInput = new ThreadLocal()
    {
        protected Object initialValue()
        {
            return new PatternMatcherInput(new char[0]);
        }
    };

    /** Used to store the Logger (used for debug and error messages). */
    transient private static Logger log = LoggingManager.getLoggerForClass();

    /**
     * This is the only Constructor.
     */
    public HTTPSamplerFull()
    {
        super();

        // Compile the regular expression:
        try {
            Perl5Compiler c= new Perl5Compiler();
            pattern= c.compile(REGEXP,
                    c.CASE_INSENSITIVE_MASK
                    |c.SINGLELINE_MASK
                    |c.READ_ONLY_MASK);
        }
        catch(MalformedPatternException mpe)
        {
            log.error("Internal error compiling regular expression in HTTPSamplerFull.");
            log.error("MalformedPatterException - " + mpe);
            throw new Error(mpe);
        }
    }

    /**
     * Samples the <code>Entry</code> passed in and stores the result in
     * <code>SampleResult</code>. The original file (which is assumed to be
     * an HTML file) is parsed into a DOM tree and examined for embedded binary
     * files.
     * <p>
     * Note that files that are duplicated within the enclosing document will
     * only be downloaded once.
     *
     * @param entry an entry to be sampled
     * @return      results of the sampling
     */
    public SampleResult sample(HTTPSampler sampler)
    {
        // Sample the container page.
        log.debug("Start : HTTPSamplerFull sample");
        SampleResult res = sampler.sample(new Entry());
        if(log.isDebugEnabled())
        {
            log.debug("Main page loading time - " + res.getTime());
        }
        return parseForImages(res, sampler);
    }

    protected SampleResult parseForImages(SampleResult res, HTTPSampler sampler)
    {
        URL baseUrl;

        String displayName = res.getSampleLabel();

        try
        {
            baseUrl = sampler.getUrl();
            if(log.isDebugEnabled())
            {
                log.debug("baseUrl - " + baseUrl.toString());
            }
        }
        catch(MalformedURLException mfue)
        {
            log.error("Error creating URL '" + displayName + "'");
            log.error("MalformedURLException - " + mfue);
            res.setResponseData(mfue.toString().getBytes());
            res.setResponseCode(HTTPSampler.NON_HTTP_RESPONSE_CODE);
            res.setResponseMessage(HTTPSampler.NON_HTTP_RESPONSE_MESSAGE);
            res.setSuccessful(false);
            return res;
        }
        
        // This is used to ignore duplicated binary files.
        // Using a LinkedHashSet to avoid unnecessary overhead in iterating
        // the elements in the set later on. As a side-effect, this will keep
        // them roughly in order, which should be a better model of browser
        // behaviour.
        Set uniqueRLs = new LinkedHashSet();
        
        // Look for unique RLs to be sampled.
        Perl5Matcher matcher = (Perl5Matcher) localMatcher.get();
        PatternMatcherInput input = (PatternMatcherInput) localInput.get();
        // TODO: find a way to avoid the cost of creating a String here --
        // probably a new PatternMatcherInput working on a byte[] would do
        // better.
        input.setInput(new String(res.getResponseData()));
        while (matcher.contains(input, pattern)) {
            MatchResult match= matcher.getMatch();
            String s;
            if (log.isDebugEnabled()) log.debug("match groups "+match.groups());
            // Check for a BASE HREF:
            s= match.group(1);
            if (s!=null) {
                try {
                    // TODO: check the performance of URL-handling on 1.4.x,
                    // since it's probably pretty bad. If it is, either cache or
                    // implement it anew.
                    baseUrl= new URL(baseUrl, s);
                    log.debug("new baseUrl from - "+s+" - " + baseUrl.toString());
                }
                catch(MalformedURLException mfue)
                {
                    log.error("Error creating base URL from BASE HREF '" + displayName + "'");
                    log.error("MalformedURLException - " + mfue);
                    res.setResponseData(mfue.toString().getBytes());
                    res.setResponseCode(HTTPSampler.NON_HTTP_RESPONSE_CODE);
                    res.setResponseMessage(HTTPSampler.NON_HTTP_RESPONSE_MESSAGE);
                    res.setSuccessful(false);
                    return res;
                }
            }
            for (int g= 2; g < match.groups(); g++) {
                s= match.group(g);
                if (log.isDebugEnabled()) log.debug("group "+g+" - "+match.group(g));
                if (s!=null) uniqueRLs.add(s);
            }
        }

        // Iterate through the RLs and download each image:
        Iterator rls= uniqueRLs.iterator();
        while (rls.hasNext()) {
            String binUrlStr= (String)rls.next();
            SampleResult binRes = new SampleResult();
            
            // set the baseUrl and binUrl so that if error occurs
            // due to MalformedException then at least the values will be
            // visible to the user to aid correction
            binRes.setSampleLabel(baseUrl + "," + binUrlStr);

            URL binUrl;
            try {
                // TODO: check the performance of URL-handling on 1.4.x,
                // since it's probably pretty bad. If it is, either cache or
                // implement it anew.
                binUrl= new URL(baseUrl, binUrlStr);
            }
            catch(MalformedURLException mfue)
            {
                log.error("Error creating URL '" + baseUrl +
                          " , " + binUrlStr + "'");
                log.error("MalformedURLException - " + mfue);
                binRes.setResponseData(mfue.toString().getBytes());
                binRes.setResponseCode(HTTPSampler.NON_HTTP_RESPONSE_CODE);
                binRes.setResponseMessage(
                    HTTPSampler.NON_HTTP_RESPONSE_MESSAGE);
                binRes.setSuccessful(false);
                res.addSubResult(binRes);
                break;
            }
            if(log.isDebugEnabled())
            {
                log.debug("Binary url - " + binUrlStr);
                log.debug("Full Binary url - " + binUrl);
            }
            binRes.setSampleLabel(binUrl.toString());
            // a browser should be smart enough to *not* download
            //   a binary file that it already has in its cache.
            try
            {
                loadBinary(binUrl, binRes, sampler);
            }
            catch(Exception ioe)
            {
                log.error("Error reading from URL - " + ioe);
                binRes.setResponseData(ioe.toString().getBytes());
                binRes.setResponseCode(HTTPSampler.NON_HTTP_RESPONSE_CODE);
                binRes.setResponseMessage(
                    HTTPSampler.NON_HTTP_RESPONSE_MESSAGE);
                binRes.setSuccessful(false);
            }
            log.debug("Adding result");
            res.addSubResult(binRes);
            res.setTime(res.getTime() + binRes.getTime());
        }

        // Okay, we're all done now
        if(log.isDebugEnabled())
        {
            log.debug("Total time - " + res.getTime());
        }
        log.debug("End   : HTTPSamplerFull sample");
        return res;
    }

    /**
     * Download the binary file from the given <code>URL</code>.
     *
     * @param url   <code>URL</code> from where binary is to be downloaded
     * @param res   <code>SampleResult</code> to store sampling results
     * @return      binary downloaded
     *
     * @throws IOException indicates a problem reading from the URL
     */
    protected byte[] loadBinary(URL url, SampleResult res, HTTPSampler sampler)
        throws Exception
    {
        log.debug("Start : loadBinary");
        byte[] ret = new byte[0];
        res.setSamplerData(new HTTPSampler(url).toString());
        HttpURLConnection conn;
        try
        {
            conn = sampler.setupConnection(url, HTTPSampler.GET,res);
            sampler.connect();
        }
        catch(Exception ioe)
        {
            // don't do anything 'cos presumably the connection will return the
            // correct http response codes
            if(log.isDebugEnabled())
            {
                log.debug("loadBinary : error in setupConnection " + ioe);
            }
            throw ioe;
        }

        try
        {
            long time = System.currentTimeMillis();
            if(log.isDebugEnabled())
            {
                log.debug("loadBinary : start time - " + time);
            }
            int errorLevel = getErrorLevel(conn, res);
            if (errorLevel == 2)
            {
                ret = sampler.readResponse(conn);
                res.setSuccessful(true);
                long endTime = System.currentTimeMillis();
                if(log.isDebugEnabled())
                {
                    log.debug("loadBinary : end   time - " + endTime);
                }
                res.setTime(endTime - time);
            }
            else
            {
                res.setSuccessful(false);
                int responseCode =
                        ((HttpURLConnection)conn).getResponseCode();
                String responseMsg =
                        ((HttpURLConnection)conn).getResponseMessage();
                log.error("loadBinary : failed code - " + responseCode);
                log.error("loadBinary : failed message - " + responseMsg);
            }
            if(log.isDebugEnabled())
            {
                log.debug("loadBinary : binary - " + ret[0]+ret[1]);
                log.debug("loadBinary : loadTime - " + res.getTime());
            }
            log.debug("End   : loadBinary");
            res.setResponseData(ret);
            res.setDataType(SampleResult.BINARY);
            return ret;
        }
        finally
        {
            try
            {
                // the server can request that the connection be closed,
                // but if we requested that the server close the connection
                // the server should echo back our 'close' request.
                // Otherwise, let the server disconnect the connection
                // when its timeout period is reached.
                sampler.disconnect(conn);
            }
            catch(Exception e)
            {
            }
        }
    }

    /**
     * Get the response code of the URL connection and divide it by 100 thus
     * returning 2 (for 2xx response codes), 3 (for 3xx reponse codes), etc.
     *
     * @param conn          <code>HttpURLConnection</code> of URL request
     * @param res           where all results of sampling will be stored
     * @return              HTTP response code divided by 100
     */
    protected int getErrorLevel(HttpURLConnection conn, SampleResult res)
    {
        log.debug("Start : getErrorLevel");
        int errorLevel = 2;
        try
        {
            int responseCode =
                    ((HttpURLConnection) conn).getResponseCode();
            String responseMessage =
                    ((HttpURLConnection) conn).getResponseMessage();
            errorLevel = responseCode/100;
            res.setResponseCode(String.valueOf(responseCode));
            res.setResponseMessage(responseMessage);
            if(log.isDebugEnabled())
            {
                log.debug("getErrorLevel : responseCode - " +
                        responseCode);
                log.debug("getErrorLevel : responseMessage - " +
                        responseMessage);
            }
        }
        catch (Exception e2)
        {
            log.error("getErrorLevel : " + conn.getHeaderField(0));
            log.error("getErrorLevel : " + conn.getHeaderFieldKey(0));
            log.error("getErrorLevel : " +
                    "Error getting response code for HttpUrlConnection - ",e2);
            res.setResponseData(e2.toString().getBytes());
            res.setResponseCode(HTTPSampler.NON_HTTP_RESPONSE_CODE);
            res.setResponseMessage(HTTPSampler.NON_HTTP_RESPONSE_MESSAGE);
            res.setSuccessful(false);
        }
        log.debug("End   : getErrorLevel");
        return errorLevel;
    }

    public static class Test extends TestCase
    {
        private HTTPSampler hsf;

        transient private static Logger log = LoggingManager.getLoggerForClass();

        public Test(String name)
        {
            super(name);
        }

        protected void setUp()
        {
            log.debug("Start : setUp1");
            hsf = new HTTPSampler();
            hsf.setMethod(HTTPSampler.GET);
            hsf.setProtocol("file");
            hsf.setPath("HTTPSamplerFullTestFile.txt");
            hsf.setImageParser(true);
            log.debug("End   : setUp1");
        }

        public void testGetUrlConfig()
        {
            log.debug("Start : testGetUrlConfig");
            assertEquals(HTTPSampler.GET, hsf.getMethod());
            assertEquals("file", hsf.getProtocol());
            assertEquals("HTTPSamplerFullTestFile.txt", hsf.getPath());
            log.debug("End   : testGetUrlConfig");
        }

        // Can't think of a self-contained way to test this 'cos it requires
        // http server.  Tried to use file:// but HTTPSampler's sample
        // specifically requires http.
        public void testSampleMain()
        {
            log.debug("Start : testSampleMain");
            // !ToDo : Have to wait till the day SampleResult is extended to
            // store results of all downloaded stuff e.g. images, applets etc
            String fileInput = "<html>\n\n" +
                    "<title>\n" +
                    "  A simple applet\n" +
                    "</title>\n" +
                    "<body background=\"back.jpg\" vlink=\"#dd0000\" "+
                            "link=\"#0000ff\">\n" +
                    "<center>\n" +
                    "<h2>   A simple applet\n" +
                    "</h2>\n" +
                    "<br>\n" +
                    "<br>\n" +
                    "<table>\n" +
                    "<td width = 20>\n" +
                    "<td width = 500 align = left>\n" +
                    "<img src=\"/tomcat.gif\">\n" +
                    "<img src=\"/tomcat.gif\">\n" +
                    "<a href=\"NervousText.java\"> Read my code <a>\n" +
                    "<p><applet code=NervousText.class width=400 " +
                    "height=200>\n" +
                    "</applet>\n" +
                    "<p><applet code=NervousText.class width=400 " +
                    "height=200>\n" +
                    "</applet>\n" +
                    "</table>\n" +
                    "<form>\n" +
                    "  <input type=\"image\" src=\"/tomcat-power.gif\">\n" +
                    "</form>\n" +
                    "<form>\n" +
                    "  <input type=\"image\" src=\"/tomcat-power.gif\">\n" +
                    "</form>\n" +
                    "</body>\n" +
                    "</html>\n";
            byte[] bytes = fileInput.getBytes();
            try
            {
                FileOutputStream fos =
                        new FileOutputStream("HTTPSamplerFullTestFile.txt");
                fos.write(bytes);
                fos.close();
            }
            catch(IOException ioe)
            {
                fail("Cannot create HTTPSamplerFullTestFile.txt in current " +
                    "directory for testing - " + ioe);
            }
            // !ToDo
            // hsf.sample(entry);
            assertNull("Cannot think of way to test sample", null);
            log.debug("End   : testSampleMain");
        }

        protected void tearDown()
        {
            log.debug("Start : tearDown");
            hsf = null;
            log.debug("End   : tearDown");
        }
    }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Update on HTMLParser

Reply via email to