> for my final project i need to write a program that enters several > news-websites and copies only the text from the relevant reports.
That's probably illegal under copyright law and almost certainly illegal under the terms of use of the web sites concerned. > i hope u (guys? anyone??) can help me with few questions: > 1. how do i know if a link is an advertise or a report? Unfortunately, whilst the W3C HTML people are enthusiastic about "the semantic web", it is generally to the advantage of commercial web sites to confuse editorial and advertising as much as possible, as, in commercial terms, the sites are there to carry the advertising, and the editorial is just a tactic to get people to read the advertising. In fact, going back to terms of use, stripping the editorial of advertising is one of the worst offences that someone can do when copying such sites, to the extent that they often try to ban user agents that try to present advertising free views. This is also part of a more general issue that web designers try to treat HTML as a vehicle for producing a display in IE, not as a means of accurately marking up a document. > 2. b\c of the diffrences of all source files there is no unification at how > to recognize a text praragrph (report body in this case) is there a way? The key term to research here is "micro formats", but the whole subject of semantic markup tends to be limited to more academic HTML coders, whereas news sites are business ventures. _______________________________________________ Lynx-dev mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/lynx-dev
