Re: parsing HTML for a web robot (crawler) like application

Adam D. Ruppe via Digitalmars-d-learn Wed, 23 Mar 2016 21:06:07 -0700

On Wednesday, 23 March 2016 at 10:49:03 UTC, Nordlöw wrote:

HTML-docs here:


http://dpldocs.info/experimental-docs/arsd.dom.html

Indeed, though the docs are still a work in progress (the lib isnow about 6 years old, but until recently, ddoc blocked me fromusing examples in the comments so I didn't bother. I've fixedthat now though, but haven't finished writing them all up).



Basic idea though for web scraping:

auto document = new Document();
document.parseGarbage(your_html_string);

// supports most the CSS syntax, and you might also know it fromjQuery

Element[] elements = document.querySelectorAll("css selector");
// or if you just want the first hit or null if none...
Element element = document.querySelector("css selector");


And once you have a reference:

element.innerText
element.innerHTML

to print its contents in some form.

You can do a lot more too (a LOT more), but just these functionsshould get you started.

The parseGarbage function will also need you to compile in thecharacterencodings.d file from my same github. It will handlecharset detection and translation as well as tag soup parsing. Iuse it for a lot of web scraping myself.

Re: parsing HTML for a web robot (crawler) like application

Reply via email to