On 31/3/06 3:08 PM, "Antone Roundy" <[EMAIL PROTECTED]> wrote:

>> The escaped HTML content contained within the content element that
>> David was originally concerned with is more than likely a copy of
>> all or part of the elements and content contained inside the body
>> tag of the external document referenced by an associated link
>> element, and therefore no guarentee that the xml:base of the atom
>> feed is going to be anywhere even close to accurate.

I'm doing something similar right now, scraping some website that doesn't
provide feeds for what I want. I check the html of the page I scraped and if
they have a <base> I use that, else I use the URL I used to fetch the page.

The tag soup I extract for each entry contains relative references. I really
don't want to go fixing that tag soup so I just stick that base url into
xml:base for each entry (and not just at the top of the feed, because I'm
scraping paginated results).

e.

Reply via email to