[ 
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15487662#comment-15487662
 ] 

Nick Burch commented on TIKA-2069:
----------------------------------

I think the idea of a Macro is probably general enough across a range of file 
formats that we could add it as an embedded type

However, there's actually 2 levels to an OOXML macro. The OOXML file contains a 
binary vba project bin file, and within that is the actual macro text + its 
properties. Maybe we should have the ooxml extractor first expose a 
`application/vnd.ms-office.vbaProject` embedded resource, then we use a second 
parser which extracts a body of the macro vbscript as {{text/x-vbasic}} with 
the other macro properties/attributes (name, sid, various boolean flags) as 
metadata?

eg {{application/vnd.ms-excel.sheet.macroenabled.12}} -> 
{{application/vnd.ms-office.vbaProject}} -> {{text/x-vbasic}} + metadata

> Extract Macro text from Microsoft Office documents
> --------------------------------------------------
>
>                 Key: TIKA-2069
>                 URL: https://issues.apache.org/jira/browse/TIKA-2069
>             Project: Tika
>          Issue Type: Improvement
>          Components: detector, parser
>    Affects Versions: 1.13
>         Environment: RHEL 5.x, Apache Tomcat
>            Reporter: Jeff Swindle
>              Labels: features
>         Attachments: excel-macro.PNG, test-macro-doc.docm, 
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm, 
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata 
> and contents, however, macros within the document are not in the metadata or 
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to