[ https://issues.apache.org/jira/browse/NUTCH-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2880: ----------------------------------- Labels: help-wanted (was: ) > parse-html/tika: update/complete HTML elements to extract outlinks from > ----------------------------------------------------------------------- > > Key: NUTCH-2880 > URL: https://issues.apache.org/jira/browse/NUTCH-2880 > Project: Nutch > Issue Type: Improvement > Components: parser, plugin > Affects Versions: 1.18 > Reporter: Sebastian Nagel > Priority: Major > Labels: help-wanted > Fix For: 1.19 > > > The list of HTML elements used to extract outlinks from (in [DOMContentUtils > (parse-html)|https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java] > and [DOMContentUtils > (parse-tika)|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java]) > needs to be updated/completed to include HTML elements common in HTML5. Cf. > a [related question on stackoverflow about the <object> > element|https://stackoverflow.com/questions/68024834/nutchsolr-how-do-you-index-a-pdf-embedded-in-html] > A (mostly?) up-to-date list of HTML elements could be taken from the > [extractor of > iipc/webarchiv-commons|https://github.com/iipc/webarchive-commons/blob/26b1e7af27abec102ab36faf6a786dfedf9436fd/src/main/java/org/archive/resource/html/ExtractingParseObserver.java#L49]. -- This message was sent by Atlassian Jira (v8.3.4#803005)