Re: [Lynx-dev] pse help.

David Woolley Thu, 11 Jun 2009 00:16:18 -0700

karsten harazim wrote:

wonder if it seems to be possible to extract information from existingwebsites into some exel document like extracting all names, adresses,phone numbers, email, url etc from pages likethat: http://www.muenster.de/schulen-alle-1.html

Technically, you need something like XSLT to do this, although you arerather dependent on the author actually writing HTML according to truespirit of HTML, which is rather rare. You may need to convert the HTMLto XML syntax, before using XSLT.

For the actual download, you would be better using one of the specialisttools, like curl or wget.

However, actually doing so is likely to be illegal. Even if you theinformation is a pure collection of facts, in countries like the UK, thewould be covered by a database copyright. At least one reason why Lynxcan get blocked form sites is that it is often used to extractinformation without the surrounding advertising/branding.


--
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.


_______________________________________________
Lynx-dev mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/lynx-dev

Re: [Lynx-dev] pse help.

Reply via email to