[
https://issues.apache.org/jira/browse/COR-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
jan iversen updated COR-20:
---------------------------
Assignee: Peter Kelly
> Write an XML/HTML parser
> ------------------------
>
> Key: COR-20
> URL: https://issues.apache.org/jira/browse/COR-20
> Project: Corinthia
> Issue Type: Improvement
> Components: DocFormats - core, DocFormats - platform
> Reporter: Peter Kelly
> Assignee: Peter Kelly
> Fix For: 0.5
>
>
> Currently we rely on libxml2 and HTML Tidy for parsing XML and HTML,
> respectively. In both cases we are only using the parsing functions of
> libraries, not other features like the DOM tree or other things.
> Parsing XML is not very difficult to do. HTML slightly more, because of all
> the ambiguities that arise from the poorly-defined parsing rules in earlier
> versions of the spec ("make a best effort" became "replicate what internet
> explorer does" because almost every site violated the rules). However the
> HTML5 spec now defines a proper parsing algorithm that deals with said
> ambiguities. We'll need to also take into account the details of which tags
> must have a corresponding close dag and which tags do not require this.
> Having our own parser will simplify dependencies a lot, particularly with the
> somewhat awkward HTML tidy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)