Re: [jira] Commented: (DROIDS-45) Fail to resovle outlink correctly

Mingfai Fri, 03 Apr 2009 04:31:35 -0700

well.. to backup my claim that creating instance has no difference in
performance, i did a profiling+benchmark base on the test case that i have
attached to the JIRA issue. There are two scenarios:


   1.  using the original create an instance per URI, and run a set of 20
   resolve operation for 100,000 times, i.e. resolve URI for 2M times.
   2. change the resolve method to a static method and run the same
   20x100,000 resolve operation

Attached is the source of the test. benchmark1() and benchmark2() are the
benchmark code. To run it, change the main() method to call either one of
the benchmark method. The println elapse time is not meaningful. You need a
profiler to get an accurate figure.

I run with JProfiler and attached are the CPU usage trees for the two cases.
My interpretation:

   - surprisingly, the total runtime is identical. both are 136s. (i thought
   the static version should be slightly faster; maybe one is 136.4 and the
   other is 136.0)
   - the fourth level item shows there are around 2M invocations. i.e. there
   are 2M URI be resolved.
   - from the instance.png, you will find the highlighted constructor init
   method costed 3% or 4s of time. This is the point I wish to deliver. 4s for
   2M link is nothing. (and 4s refers to my laptop's CPU)
   - if we want to optimize, we probably want to change the regex replaceAll
   with some manual String replacement. It will have greater impact. From the
   graph, you'll 12% of time are spent on that.
   - remarks: there are some flaws as the number of inv. in two cases do not
   match. But i suppose it doesn't cause meaningful difference.

And I also want to say, micro benchmark like this is not too meaningful. I
think we could easily improve the efficiency by a larger factor by profiling
the whole Droids application. And I guess memory usage is another big
concern. (esp for me as i use DOM parser)

p.s. I am *not* suggesting we should have a Link Resolver that write in my
style. I just follow the a style similar to those HTML Parser to code the
URI resolving workaround and test case quickly. The only advantage I see is
I may potentially do other operations like normalize to change the stored
"link", and constructor with other params, but the adv is not significant
and I'm fine to implement this in whatever way.

regards,
mingfai



On Fri, Apr 3, 2009 at 5:40 PM, Mingfai <[email protected]> wrote:

> hi,
>
> thx. I'd like to see the requirement be included in Droids, and the
> implementation is not an issue to me. My code are not in droids package and
> I didn't change the Link Extractor to use it at all. They are just for
> reference and don't worry if you want to discard them. It would be great if
> the Link Extractor could be refactored, so for the next time, i could submit
> a patch.
>
> re. instance. I believe in modern JVM , there is no difference in
> performance, unless you keep a reference to it and not releasing it for GC.
> And btw, I don't think my implementation that uses indexof instead of regex
> makes meaningful improvement, too. It's more my coding style.  And indeed, I
> don't really care about "micro" performance for a crawler. My primary
> language for the project is Groovy that is way slower than pure Java. :-)
>
> regards,
> mingfai
>
>
>
> On Fri, Apr 3, 2009 at 4:11 PM, Thorsten Scherler <
> [email protected]> wrote:
>
>> On Thu, 2009-04-02 at 06:10 -0700, Mingfai Ma (JIRA) wrote:
>> > [
>> https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694995#action_12694995]
>> >
>> > Mingfai Ma commented on DROIDS-45:
>> > ----------------------------------
>> >
>> > the LinkExtractor doesn't append '/' automatically.
>>
>> Hmm, I just asked Javier to have a look into this since he had been the
>> last that worked with the code. Will try to find some time to debug this
>> weekend since need to finish a project ATM.
>>
>> One small thing I found in your class is the constructor, it is not
>> optimal since we would be forced to create a lot of instances (for each
>> base/link), that needs rethinking to reuse the class.
>>
>> salu2
>>
>>
>> > and I think it shouldn't, as it is possible for a server to handle with
>> and without '/' differently. For root domain URL, it may be ok. but for
>> deeper URL, we can't just assume the last segment of the request path is a
>> directory
>> >
>> > Apache mod_dir should append a trailing slash but unfortunately, not all
>> web server on the internet have this feature enabled :-)
>> > http://httpd.apache.org/docs/2.2/mod/mod_dir.html
>> >
>> > > Fail to resovle outlink correctly
>> > > ---------------------------------
>> > >
>> > >                 Key: DROIDS-45
>> > >                 URL: https://issues.apache.org/jira/browse/DROIDS-45
>> > >             Project: Droids
>> > >          Issue Type: Bug
>> > >          Components: core
>> > >    Affects Versions: 0.01
>> > >            Reporter: Mingfai Ma
>> > >
>> > > I've encountered several cases that outlinks are not extracted
>> correctly. Most are cause by the use of URI.resolve().
>> > > 1. For a base URI of new URI("http://www.domain.com";), <a
>> href="test.html">test.html</a> will be resolved to
>> http://www.domain.comtest.html
>> > > 2. For a base URI of new URI("http://www.domain.com/index.php";), <a
>> href="?test=true">test with param</a> will be resolved to
>> http://www.domain.com/?test=true
>> > > 3. for <a href="http://www.yahoo.com\n";>line break!</a>, URL.resolve
>> will throw exception. And in a browser, it can resolves the URI. (remarks: I
>> didn't check if this scenario affect the default Tika/NekoHTML parsing. )
>> > > I suspect there are many different scenarios, many of them are
>> probably caused by non-standard usage. (but a crawler has to handle
>> non-standard usage in order to function) Obviously, we cannot cater every
>> case, and I suggest to consider a resolve failure as a bug if a link works
>> in a Mozilla browser but not in Droids LinkExtractor.
>> > > this issue is related to the LinkExtractor created in DROIDS-8
>> >
>> --
>> Thorsten Scherler <thorsten.at.apache.org>
>> Open Source Java <consulting, training and solutions>
>>
>> Sociedad Andaluza para el Desarrollo de la Sociedad
>> de la Información, S.A.U. (SADESI)
>>
>>
>>
>>
>>
>

package benchmark;

import java.io.UnsupportedEncodingException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * It tries to resolve relative link and handle any possible value in the href attribute of the <a> HTML tag, including
 * javascript:
 * <p/>
 * The following is a list of special cases:
 * - mailto:[email protected], news:comp.lang.java, urn:isbn:096139210x
 */
public class LinkResolver {
    public static final String[] IGNORED_PREFIX = new String[]{"javascript:", "mailto:"};
    protected URI base;
    protected String link;

    public LinkResolver(URI base, String link) {
        if (base == null) throw new IllegalArgumentException("base cannot be null");
        this.base = base;
        this.link = link;
    }

    public URI resolve() {
        try {
            String target = normalizeURL(link);

            for (String ignored : IGNORED_PREFIX) {
                if (target.indexOf(ignored) != -1) return null;
            }

            int sharpSymbolIndex = (target.length() > 0) ? target.indexOf('#') : -1;
            if (sharpSymbolIndex != -1) {
                target = target.substring(0, sharpSymbolIndex);
            }

            if (target.length() == 0) {
                return base;
            } else if (target.indexOf(":/") != -1) { //has :/
                return new URI(target);
            } else {
                if (target.charAt(0) != '/') target = '/' + target;
                return base.resolve(target);
            }
        } catch (Exception e) {
        }
        return null;
    }


    private static String normalizeURL(String uri) throws UnsupportedEncodingException {
        if (uri == null) return "";
        String target = uri.trim();
        if (target.indexOf(' ') != -1) target = target.replaceAll("\\s", "+");
        return target;
    }

    public String toString() {
        return super.toString() + ": base: " + base + ", link: " + link;
    }

    public static URI resolve(URI base, String link) {
        return new LinkResolver(base, link).resolve();
    }

    public static URI staticResolve(URI base, String link) {
        try {
            String target = normalizeURL(link);

            for (String ignored : IGNORED_PREFIX) {
                if (target.indexOf(ignored) != -1) return null;
            }

            int sharpSymbolIndex = (target.length() > 0) ? target.indexOf('#') : -1;
            if (sharpSymbolIndex != -1) {
                target = target.substring(0, sharpSymbolIndex);
            }

            if (target.length() == 0) {
                return base;
            } else if (target.indexOf(":/") != -1) { //has :/
                return new URI(target);
            } else {
                if (target.charAt(0) != '/') target = '/' + target;
                return base.resolve(target);
            }
        } catch (Exception e) {
        }
        return null;
    }

    public static void main(String[] args) {
        benchmark2();
    }

    static void benchmark1() {
        long startTime = System.currentTimeMillis();
        URI base = null;
        try {
            base = new URI("http://www.apache.org";);
            int max = 100000;
            for (int i = 0; i < max; i++) {
                new LinkResolver(base, null).resolve().toString();
                new LinkResolver(base, "").resolve().toString();
                new LinkResolver(base, "/").resolve().toString();

                new LinkResolver(base, "/index.html").resolve().toString();
                new LinkResolver(base, "index.html").resolve().toString();
                new LinkResolver(base, "?test=true").resolve().toString();
                new LinkResolver(base, "/?test=true").resolve().toString();
                new LinkResolver(base, "/index.jsp?test=true").resolve().toString();
                new LinkResolver(base, "/index.html#header1").resolve().toString();
                new LinkResolver(base, "/index.html?test=true#header1").resolve().toString();
                new LinkResolver(base, "http://httpd.apache.org";).resolve().toString();
                new LinkResolver(base, "mailto:[email protected]").resolve();
                new LinkResolver(base, "javascript:[email protected]").resolve();
                new LinkResolver(base, "javaTYPOscript:[email protected]").resolve().toString();
                new LinkResolver(base, "news:comp.lang.java";).resolve().toString();
                new LinkResolver(base, "news://news.apache.org/comp.lang.java";).resolve().toString();
                new LinkResolver(base, "/index.html\n").resolve().toString();
                new LinkResolver(base, "/index.jsp?message=??").resolve().toString();
                new LinkResolver(base, "http://www.apache.org/index.jsp?message=??";).resolve().toString();
                new LinkResolver(base, "/index.jsp?message=debug info").resolve().toString();
            }
            System.out.println(System.currentTimeMillis() - startTime);
        } catch (URISyntaxException e) {
            e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
        }
    }

    static void benchmark2() {
        long startTime = System.currentTimeMillis();
        URI base = null;
        try {
            base = new URI("http://www.apache.org";);
            int max = 100000;
            for (int i = 0; i < max; i++) {
                LinkResolver.staticResolve(base, null).toString();
                LinkResolver.staticResolve(base, "").toString();
                LinkResolver.staticResolve(base, "/").toString();

                LinkResolver.staticResolve(base, "/index.html").toString();
                LinkResolver.staticResolve(base, "index.html").toString();
                LinkResolver.staticResolve(base, "?test=true").toString();
                LinkResolver.staticResolve(base, "/?test=true").toString();
                LinkResolver.staticResolve(base, "/index.jsp?test=true").toString();
                LinkResolver.staticResolve(base, "/index.html#header1").toString();
                LinkResolver.staticResolve(base, "/index.html?test=true#header1").toString();
                LinkResolver.staticResolve(base, "http://httpd.apache.org";).toString();
                LinkResolver.staticResolve(base, "mailto:[email protected]");
                LinkResolver.staticResolve(base, "javascript:[email protected]");
                LinkResolver.staticResolve(base, "javaTYPOscript:[email protected]").toString();
                LinkResolver.staticResolve(base, "news:comp.lang.java";).toString();
                LinkResolver.staticResolve(base, "news://news.apache.org/comp.lang.java";).toString();
                LinkResolver.staticResolve(base, "/index.html\n").toString();
                LinkResolver.staticResolve(base, "/index.jsp?message=??").toString();
                LinkResolver.staticResolve(base, "http://www.apache.org/index.jsp?message=??";).toString();
                LinkResolver.staticResolve(base, "/index.jsp?message=debug info").toString();
            }
            System.out.println(System.currentTimeMillis() - startTime);
        } catch (URISyntaxException e) {
            e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
        }
    }
}

Re: [jira] Commented: (DROIDS-45) Fail to resovle outlink correctly

Reply via email to