[
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887716#action_12887716
]
Julien Nioche commented on TIKA-463:
------------------------------------
creating a LinksHtmlMapper : +1, that would be a nice intermediate between the
default mapper and the identity mapper
handling of links in mapper : mapSafeAttribute() returns a normalised
representation of the attribute names that are allowed but does not affect the
value of the attributes. Maybe we should change the method so that it returns
BOTH the normalised name (or null of the attribute must be skipped) and the
corresponding normalised value (e.g. the resolved URL) given a name/value
couple. The mapper implementation could then manage the resolution of the URLs
internally. This would also be useful for normalising the names and values of
elements in the header such as http-equiv.
HtmlParser as an abstract class : what about following Jukka's suggestion for
Handlers in https://issues.apache.org/jira/browse/TIKA-458 and have a Factory?
As for frames, it raises another issue (see
https://issues.apache.org/jira/browse/TIKA-457) which is that anything outside
<body> and <head> is currently discarded by the HTMLMapper. This is why I
considered doing TIKA-458 but maybe we could make the HTMLHandler more generic
and delegate the decisions to the Mappers e.g. by adding a method isBody().
The body level is currently used to :
a) distinguish the elements in the header
b) determine where characters should be added to the text of the document
Do we really need (a)? Are elements such as LINK, BASE or META found anywhere
outside the HEAD? Should mapSafeElement() take into account the path of an
element as well e.g. to allow a <link> only if it has <head> for parent?
> HtmlParser doesn't extract links from img, map, object, frame, iframe, area,
> link
> ---------------------------------------------------------------------------------
>
> Key: TIKA-463
> URL: https://issues.apache.org/jira/browse/TIKA-463
> Project: Tika
> Issue Type: Bug
> Reporter: Ken Krugler
> Assignee: Ken Krugler
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants,
> then all of the above are valid, and thus should be emitted by the parser,
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.