On Saturday, 10 January 2015 at 17:23:31 UTC, Ola Fosheim Grøstad wrote:
For the challenge to make sense it would entail parsing all legal HTML5 documents, extracting all resource links, converting them into absolute form and printing them one per line. With no hickups.

Though, that's still a library thing rather than a language thing.

dom.d and the Url struct in cgi.d should be able to do all that, in just a few lines even, but that's just because I've done a *lot* of web scraping with the libs before so I made them work for that.

In fact.... let me to do it. I'll use my http2.d instead of cgi.d, actually, it has a similar Url struct just more focused on client requests.


import arsd.dom;
import arsd.http2;
import std.stdio;

void main() {
        auto base = Uri("http://www.stroustrup.com/C++.html";);
        // http2 is a newish module of mine that aims to imitate
        // a browser in some ways (without depending on curl btw)
        auto client = new HttpClient();
        auto request = client.navigateTo(base);
        auto document = new Document();

        // and http2 provides an asynchonous api but you can
        // pretend it is sync by just calling waitForCompletion
        auto response = request.waitForCompletion();

// parseGarbage uses a few tricks to fixup invalid/broken HTML // tag soup and auto-detect character encodings, including when
        // it lies about being UTF-8 but is actually Windows-1252
        document.parseGarbage(response.contentText);

// Uri.basedOn returns a new absolute URI based on something else
        foreach(a; document.querySelectorAll("a[href]"))
                writeln(Uri(a.href).basedOn(base));
}


Snippet of the printouts:

[...]
http://www.computerhistory.org
http://www.softwarepreservation.org/projects/c_plus_plus/
http://www.morganstanley.com/
http://www.cs.columbia.edu/
http://www.cse.tamu.edu
http://www.stroustrup.com/index.html
http://www.stroustrup.com/C++.html
http://www.stroustrup.com/bs_faq.html
http://www.stroustrup.com/bs_faq2.html
http://www.stroustrup.com/C++11FAQ.html
http://www.stroustrup.com/papers.html
[...]

The latter are relative links that it based on and the first few are absolute. Seems to have worked.


There's other kinds of links than just a[href], but fetching them is as simple as adding them to the selector or looping for them too separately:

        foreach(a; document.querySelectorAll("script[src]"))
                writeln(Uri(a.src).basedOn(base));

none on that page, no <link>s either, but it is easy enough to do with the lib.



Looking at the source of that page, I find some invalid HTML and lies about the character set. How did Document.parseGarbage do? Pretty well, outputting the parsed DOM tree shows it auto-corrected the problems I see by eye.

Reply via email to