Hi - if an HTML doc doesn't have a line like the following in the header:

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

... Zend_Search_Lucene_Document_Html::loadHTML seems to mangle the title & body as opposed to make an 'intelligent' guess. FWIW, the docs state:

------
Zend_Search_Lucene_Document_Html class uses the DOMDocument::loadHTML() and DOMDocument::loadHTMLFile() methods to parse the source HTML, so it doesn't need HTML to be well formed or to be XHTML. On the other hand, it's sensitive to the encoding specified by the "meta http-equiv" header tag.
------

... I would think that "doesn't need HTML to be well formed" == "doesn't need a Content-Type tag in the header", especially since DOMDocument::loadHTML (the parser used by Zend_Search_Lucene_Document_Html::loadHTML) doesn't seem to need this set in order to parse the HTML title & body correctly ...

Should this be considered a bug, or is what I'm seeing deemed proper behavior ? Currently I'm crawling a number of internal (poorly formed) docs, and as a result of this situation I'm now going to have to parse the HTML to see if the tag is present, and slip it in if it is not ...

An example to illustrate - the following snippet:

#====================================================================
require_once 'Zend/Search/Lucene.php';
$html =  '<HTML>'
        .'  <HEAD>'
        .'    <TITLE>This is the title</TITLE>'
        .'  </HEAD>'
        .'  <BODY>This is the body</BODY>'
        .'</HTML>';
$doc = Zend_Search_Lucene_Document_Html::loadHTML($html);
print_r($doc);
#====================================================================

... results in the following output:

Zend_Search_Lucene_Document_Html Object
(
<---- snip ---->
    [_fields:protected] => Array
        (
            [title] => Zend_Search_Lucene_Field Object
                (
                    [name] => title
                    [value] =>
<---- snip ---->

            [body] => Zend_Search_Lucene_Field Object
                (
                    [name] => body
[value] => HTML> This is the title This is the body
<---- snip ---->




Reply via email to