[
https://issues.apache.org/jira/browse/TIKA-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2391:
------------------------------
Attachment: testScripts.htm
proposed_output.txt
I downloaded the html page suggested by [~jayesh] on TIKA-2382, and I've dumped
the proposed output in the RecursiveParserWrapper format.
There are 10 metadata objects. The first contains the main page, and then
there are 9 scripts.
I'm not sure what we should do with the {{src=}} info, when a script relies on
an external resource rather than inlining the code.
Dumb question: what other types besides js can we have? Should we have a
mapping from {{type=}} to mimetype that we can pass in to the child's metadata?
For now, we're still ignoring {{<style>}} elements.
I'd want to require users to turn this behavior on via an HTMLParserConfig.
Big question, what do you think? Other areas for improvements?
> Extract js in html as "attachment" type MACRO like we do in the PDFParser
> -------------------------------------------------------------------------
>
> Key: TIKA-2391
> URL: https://issues.apache.org/jira/browse/TIKA-2391
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: proposed_output.txt, testScripts.htm
>
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)