[ 
https://issues.apache.org/jira/browse/TIKA-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2391:
------------------------------
    Attachment: testScripts.htm
                proposed_output.txt

I downloaded the html page suggested by [~jayesh] on TIKA-2382, and I've dumped 
the proposed output in the RecursiveParserWrapper format.

There are 10 metadata objects.  The first contains the main page, and then 
there are 9 scripts.

I'm not sure what we should do with the {{src=}} info, when a script relies on 
an external resource rather than inlining the code.

Dumb question: what other types besides js can we have?  Should we have a 
mapping from {{type=}} to mimetype that we can pass in to the child's metadata?

For now, we're still ignoring {{<style>}} elements.

I'd want to require users to turn this behavior on via an HTMLParserConfig.

Big question, what do you think?  Other areas for improvements? 

> Extract js in html as "attachment" type MACRO like we do in the PDFParser
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2391
>                 URL: https://issues.apache.org/jira/browse/TIKA-2391
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: proposed_output.txt, testScripts.htm
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to