On Saturday, 10 January 2015 at 17:23:31 UTC, Ola Fosheim Grøstad
wrote:
For the challenge to make sense it would entail parsing all
legal HTML5 documents, extracting all resource links,
converting them into absolute form and printing them one per
line. With no hickups.
Though, that's still a library thing rather than a language thing.
dom.d and the Url struct in cgi.d should be able to do all that,
in just a few lines even, but that's just because I've done a
*lot* of web scraping with the libs before so I made them work
for that.
In fact.... let me to do it. I'll use my http2.d instead of
cgi.d, actually, it has a similar Url struct just more focused on
client requests.
import arsd.dom;
import arsd.http2;
import std.stdio;
void main() {
auto base = Uri("http://www.stroustrup.com/C++.html");
// http2 is a newish module of mine that aims to imitate
// a browser in some ways (without depending on curl btw)
auto client = new HttpClient();
auto request = client.navigateTo(base);
auto document = new Document();
// and http2 provides an asynchonous api but you can
// pretend it is sync by just calling waitForCompletion
auto response = request.waitForCompletion();
// parseGarbage uses a few tricks to fixup invalid/broken
HTML
// tag soup and auto-detect character encodings,
including when
// it lies about being UTF-8 but is actually Windows-1252
document.parseGarbage(response.contentText);
// Uri.basedOn returns a new absolute URI based on
something else
foreach(a; document.querySelectorAll("a[href]"))
writeln(Uri(a.href).basedOn(base));
}
Snippet of the printouts:
[...]
http://www.computerhistory.org
http://www.softwarepreservation.org/projects/c_plus_plus/
http://www.morganstanley.com/
http://www.cs.columbia.edu/
http://www.cse.tamu.edu
http://www.stroustrup.com/index.html
http://www.stroustrup.com/C++.html
http://www.stroustrup.com/bs_faq.html
http://www.stroustrup.com/bs_faq2.html
http://www.stroustrup.com/C++11FAQ.html
http://www.stroustrup.com/papers.html
[...]
The latter are relative links that it based on and the first few
are absolute. Seems to have worked.
There's other kinds of links than just a[href], but fetching them
is as simple as adding them to the selector or looping for them
too separately:
foreach(a; document.querySelectorAll("script[src]"))
writeln(Uri(a.src).basedOn(base));
none on that page, no <link>s either, but it is easy enough to do
with the lib.
Looking at the source of that page, I find some invalid HTML and
lies about the character set. How did Document.parseGarbage do?
Pretty well, outputting the parsed DOM tree shows it
auto-corrected the problems I see by eye.