Re: [Lynx-dev] seporating main text from whole page

David Woolley Thu, 29 Mar 2007 22:22:54 -0800

> for my final project i need to write a program that enters several
> news-websites and copies only the text from the relevant reports.


That's probably illegal under copyright law and almost certainly
illegal under the terms of use of the web sites concerned.

> i hope u (guys? anyone??) can help me with few questions:
> 1. how do i know if a link is an advertise or a report?

Unfortunately, whilst the W3C HTML people are enthusiastic about
"the semantic web", it is generally to the advantage of commercial
web sites to confuse editorial and advertising as much as possible, as,
in commercial terms, the sites are there to carry the advertising, and
the editorial is just a tactic to get people to read the advertising.

In fact, going back to terms of use, stripping the editorial of advertising
is one of the worst offences that someone can do when copying such sites,
to the extent that they often try to ban user agents that try to present
advertising free views.

This is also part of a more general issue that web designers try to treat
HTML as a vehicle for producing a display in IE, not as a means of accurately
marking up a document.

> 2. b\c of the diffrences of all source files there is no unification at how
> to recognize a text praragrph (report body in this case) is there a way?

The key term to research here is "micro formats", but the whole subject
of semantic markup tends to be limited to more academic HTML coders, whereas
news sites are business ventures.



_______________________________________________
Lynx-dev mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/lynx-dev

Re: [Lynx-dev] seporating main text from whole page

Reply via email to