On Aug 13, 2010, at 2:06am, Andrzej Bialecki wrote:

On 2010-08-13 10:34, Jukka Zitting wrote:
Hi,

On Thu, Aug 12, 2010 at 8:27 PM, Ken Krugler
<[email protected]>  wrote:
I think I'm missing something - which javadocs are your referring to here?
What I see for startDocument() is:

   /**
    * Starts an XHTML document by setting up the namespace mappings.
    * The standard XHTML prefix is generated lazily when the first
    * element is started.
    */

I guess the "standard XHTML prefix" is a bit vague here... Mea culpa.
The intention was that XHTMLContentHandler would provide everything up
to the opening<body>  tag when startDocument() is called.

I saw your note on the issue in Jira:
[...]
This would work for<meta>, but not<link>  or<base>.

I'd argue that we shouldn't output the<base>  element. Instead we
should normalize all URLs before giving them out to the client.

Normalization rules may depend on situation... we could provide a sensible default but I think it's safer to delegate this decision to a component that you can override, because in general case normalization rules may be quite complex.

Example 1: you access a page from www.ibm.com/index.html, which redirects to www-8.ibm.com/index.html for load-balancing. The retrieved page may contain <base> that points back to www.ibm.com - again, to ensure proper load-balancing. In this case, base href != page URL. Now, how do you normalize the links from the retrieved page? (at some point in time this was a real case with this real site ;) ).

Example 2: <base> is http://a.com/index.html/index.html/index.html (which is related to a known bug in some HTTP servers), and the outlink is ../services.html. How do you normalize this?

Of course, you can come up with some sensible defaults in each case, but my point is that this issue is complicated, and there should be a way to redefine this behavior.

I think Julien's idea about pushing more/most of this down into the HtmlMapper makes sense, as that feels like the only way to really give appropriate control over this behavior in a way that can be easily subclassed.

It's a bigger architectural change than what I have time for right now, so currently I'm extending the existing architecture to work around specific issues I'm hitting.

I did take Jukka's advice and emit all metadata elements in the resulting XHTML's <head> section. This provides better support for other parsers besides HTML, though it means that the resulting HTML can look a bit funky right now - for example, you will often get two <meta> tags, one for "Content-Type" and the other for "content-type", because HtmlHandler is remapping a <meta http-equiv> element. I've got that on my list to resolve.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to