Jason wrote:
> Hi all,
>     I have come across what I think is a curious but insidious bug with
> the java lucene hit highlighter.
[...]
> when I search for -> Acquisition Plan <-
> in my search results I get:
> <summary>(ancilliary stuff deleted)....
> attached to the <em>Acquisition</em>
> < em>Plan</em>and signed</summary>
> 
> notice the space between the < and e in the second < em>

Sorry, Jason, I don't have a solutions for you, but in case there's any
question about whether "< em>" is well-formed XML/XHTML/HTML:

1. It is not well-formed XML (and thus cannot be well-formed XHTML) -
from <http://www.w3.org/TR/xml/#sec-starttags>:

  [40] STag ::= '<' Name (S Attribute)* S? '>'
   [5] Name ::= (Letter | '_' | ':') (NameChar)*

("Letter" & "NameChar" declarations omitted - suffice to say whitespace
is excluded.)


2. AFAICT (IANASG), SGML (and hence the [pre-XHTML] HTML profiles of it)
disallows space chars between the '<' and the element name (a.k.a.
"generic identifier") - from
<http://www.oasis-open.org/cover/sgmlsyn/sgmlsyn.htm#C7.4>:

 [14] start-tag =
        ( stago , <
          document type specification [28] ,
          generic identifier specification [29] ,
          attribute specification list [31] ,
          s [5] *,
          tagc ) | >
          minimized start-tag [15]
 [29] generic identifier specification =
          generic identifier [30] | rank stem [120]
 [30] generic identifier = name [55]
[120] rank stem = name [55]
 [55] name = name start character [53] , name character [52] *

(Note 1: "name" & "name start character" declarations omitted - suffice
to say whitespace is excluded.)

(Note 2: "document type specification" declaration omitted, because all
HTML profiles include the "CONCUR NO" option, thus excluding this syntax.)

(Note 3: "minimized start-tag" declaration omitted, because although all
HTML profiles include the "SHORTTAG YES" option, the
element-minimization aspects of this option [as distinct from attribute
minimization, e.g. omitted and unquoted attribute values] are not
supported by mainstream browsers; in any case, whitespace is disallowed
prior to generic identifiers in all of the minimized start tag forms.)


3. Firefox 2.0.0.1 and IE 7.0 on WinXP both render "< em>...</em>" as
literal "< em>..." - the (malformed) start tag is rendered as non-markup
plain text, and the close tag is not displayed.


Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to