On Wed, 2004-02-11 at 10:34, Hugo Ahlenius wrote:
> Do you have any control over the sites archived? Are they XHTML?
I have control of the pages after they are collected, but not the end
sites. Without getting in trouble, it needs to collect a bunch of site's
pages (and more will be added in the future)(that part is done btw) it
then needs to get the main contents (largest amount of text in most
cases) finally search the text for keywords, save the text, and do some
stats on the information as well.

I am leaning at present to not strip the html, but the problem is that
pages are coming in at about 34K - this process, at present, collects
about 10MB a day - that is going to get out of control rather fast. The
actual information needed is only about 4-5K

> If yes on the first question: enclose the relevant "main body" in div
> tags, with a relevant ID. Then, if they are XHTML, you can parse it as
> XML. If not it might still be possible to apply some regexps to get the
> content (locate the start and end of the main body).
yeah that would be cool if I could control the other side :(

--
Vale,
Rob

Luxuria immodica insaniam creat.
Sanam formam viatae conservate!

http://www.rohanclan.com
http://treebeard.sourceforge.net
http://ashpool.sourceforge.net
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Reply via email to