Re: For those ready to take the challenge

Adam D. Ruppe via Digitalmars-d-learn Sat, 10 Jan 2015 09:40:59 -0800

On Saturday, 10 January 2015 at 17:23:31 UTC, Ola Fosheim Grøstadwrote:

For the challenge to make sense it would entail parsing alllegal HTML5 documents, extracting all resource links,converting them into absolute form and printing them one perline. With no hickups.


Though, that's still a library thing rather than a language thing.

dom.d and the Url struct in cgi.d should be able to do all that,in just a few lines even, but that's just because I've done a*lot* of web scraping with the libs before so I made them workfor that.

In fact.... let me to do it. I'll use my http2.d instead ofcgi.d, actually, it has a similar Url struct just more focused onclient requests.



import arsd.dom;
import arsd.http2;
import std.stdio;

void main() {
        auto base = Uri("http://www.stroustrup.com/C++.html";);
        // http2 is a newish module of mine that aims to imitate
        // a browser in some ways (without depending on curl btw)
        auto client = new HttpClient();
        auto request = client.navigateTo(base);
        auto document = new Document();

        // and http2 provides an asynchonous api but you can
        // pretend it is sync by just calling waitForCompletion
        auto response = request.waitForCompletion();

// parseGarbage uses a few tricks to fixup invalid/brokenHTML// tag soup and auto-detect character encodings,including when

        // it lies about being UTF-8 but is actually Windows-1252
        document.parseGarbage(response.contentText);

// Uri.basedOn returns a new absolute URI based onsomething else

        foreach(a; document.querySelectorAll("a[href]"))
                writeln(Uri(a.href).basedOn(base));
}


Snippet of the printouts:

[...]
http://www.computerhistory.org
http://www.softwarepreservation.org/projects/c_plus_plus/
http://www.morganstanley.com/
http://www.cs.columbia.edu/
http://www.cse.tamu.edu
http://www.stroustrup.com/index.html
http://www.stroustrup.com/C++.html
http://www.stroustrup.com/bs_faq.html
http://www.stroustrup.com/bs_faq2.html
http://www.stroustrup.com/C++11FAQ.html
http://www.stroustrup.com/papers.html
[...]

The latter are relative links that it based on and the first feware absolute. Seems to have worked.

There's other kinds of links than just a[href], but fetching themis as simple as adding them to the selector or looping for themtoo separately:


        foreach(a; document.querySelectorAll("script[src]"))
                writeln(Uri(a.src).basedOn(base));

none on that page, no <link>s either, but it is easy enough to dowith the lib.

Looking at the source of that page, I find some invalid HTML andlies about the character set. How did Document.parseGarbage do?Pretty well, outputting the parsed DOM tree shows itauto-corrected the problems I see by eye.

Re: For those ready to take the challenge

Reply via email to