[
https://issues.apache.org/jira/browse/ANY23-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236410#comment-17236410
]
Lewis John McGibbney commented on ANY23-457:
--------------------------------------------
Using rover in master branch I cannot replicate this... after a few hours of
debugging and writing local unit tests I am a bit puzzled.
The [following
code|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParsingConfiguration.java#L62]
definitely skips over by the DOCTYPE declaration
{code:java}
private static Document convert(org.jsoup.nodes.Document document) {
Document w3cDoc = new org.apache.html.dom.HTMLDocumentImpl();
org.jsoup.nodes.Element rootEl = document.children().first(); //
SKIPS DOCTYPE
if (rootEl != null) {
NodeTraversor.traverse(new DocumentConverter(w3cDoc), rootEl);
}
return w3cDoc;
}
{code}
... however I am not able to reproduce the bug above now. Closing off until I
experience this again.
> Fix error: White spaces are required between publicId and systemId
> ------------------------------------------------------------------
>
> Key: ANY23-457
> URL: https://issues.apache.org/jira/browse/ANY23-457
> Project: Apache Any23
> Issue Type: Bug
> Components: fix, rule
> Affects Versions: 2.4
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 2.5
>
>
> This problem is encountered when we attempt to parse the following HTML
> https://www-robotics.jpl.nasa.gov/links/index.cfm
> https://www-robotics.jpl.nasa.gov/patents/index.cfm
> ERROR rdf.BaseRDFExtractor - Error while parsing RDF document.
> White spaces are required between publicId and systemId
> If one looks at the HTML source you will see the following
> {code:html}
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> {code}
> Reading [this article|https://stackoverflow.com/a/9225499], it looks like we
> may be able to create a rule and 'fix' which would create the following
> {code:html}
> <!-- Notice the addition of "" -->
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">
> <html>
> <head>
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)