On Fri, Mar 4, 2011 at 11:50 PM, Peter Desjardins <peter.desjardins.us@
gmail.com> wrote:

> On Fri, Mar 4, 2011 at 11:02 AM, Kasun Gajasinghe
> <[email protected]> wrote:
>
> > The main issue with HTML is with the html-search feature. To properly
> > retrieve the content text excluding the html-tags, the html files should
> be
> > in a proper format. Strict XML is the standard way for this. That's the
> > concern here. I haven't encountered any other major issue in switching to
> > html!
> > Looking at your mail, I'm assume you are switching from html to xhtml,
> > right? If so, have you encountered any concerns that needs some major
> > effort? If so, tell us about it, we'll see about the possibility of
> > supporting to html format too.
>
> I switched from your default XHTML to HTML. I didn't see any problems
> and I tried searching for a few terms. The search feature seemed to
> work properly. Maybe XHTML isn't required for the webhelp format at
> all?
>

The HTML tree is not a well-formed XML tree, meaning there will be traversal
issues if the html is parsed using a XML parser. The search would still
work, but will be broken due to the possibility that some contents won't get
indexed. These contents won't appear in the search results. It's something
like what you said in the 3rd para in the first post about looking for </a>
tag for the <a/> tag! You can't test this by searching for *few* queries.

But from what I have seen, the un-indexed content for html is fairly low,
and therefore you can depend on it with a small amount of error. On the
other hand, XHTML is completely based on XML, so the SAX XML parser has no
issue in parsing the content.

There's some tools out there to parse dirty HTML tags and retrieve it's
whole content. But lot of good tools don't have a compatible license with
DocBook. Htmlcleaner looks like a good solution for adding the support for
indexing/searching *html* files though. So, full support for html would
come!

Regards,
--Kasun

-- 
~~~*******'''''''''''''*******~~~
Kasun Gajasinghe,
University of Moratuwa,
Sri Lanka.
Blog: http://kasunbg.blogspot.com
Twitter: http://twitter.com/kasunbg

Reply via email to