[
https://issues.apache.org/jira/browse/TIKA-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15509825#comment-15509825
]
Tim Allison commented on TIKA-2069:
-----------------------------------
I think I get it. One challenge is that we're currently getting a
{{Map<String, String>}} from POI, there doesn't seem currently to be an obvious
way to link metadata to the actual text. On POI's test doc,
with this code:
{noformat}
VBAMacroReader reader = new VBAMacroReader(fs);
for (Map.Entry<String, String> e : reader.readMacros().entrySet()) {
Metadata m = new Metadata();
m.set(Metadata.EMBEDDED_RESOURCE_TYPE,
TikaCoreProperties.EmbeddedResourceType.MACRO.toString());
m.set(Metadata.CONTENT_TYPE, "text/x-vbasic");
EmbeddedDocumentExtractor ex =
context.get(EmbeddedDocumentExtractor.class);
if (ex == null) {
ex = new ParsingEmbeddedDocumentExtractor(context);
}
if (ex.shouldParseEmbedded(m)) {
ex.parseEmbedded(new
ByteArrayInputStream(e.getValue().getBytes(StandardCharsets.UTF_8)), xhtml, m,
true);
}
}
{noformat}
we get:
{noformat}
1: X-Parsed-By : org.apache.tika.parser.DefaultParser
1: X-Parsed-By : org.apache.tika.parser.txt.TXTParser
1: embeddedResourceType : MACRO
1: Content-Encoding : windows-1252
1: X-TIKA:parse_time_millis : 27
1: X-TIKA:content : <html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.txt.TXTParser" />
<meta name="embeddedResourceType" content="MACRO" />
<meta name="Content-Encoding" content="windows-1252" />
<meta name="X-TIKA:embedded_resource_path" content="/embedded-1" />
<meta name="Content-Type" content="text/plain; charset=windows-1252" />
<title></title>
</head>
<body><p>Attribute VB_Name = "Module1"
Sub TestMacro()
'
' TestMacro Macro
' This is a test macro
'
'
ActiveDocument.Paragraphs(1).Range.Text = "This is a macro word processing
document"
End Sub
</p>
</body></html>
1: X-TIKA:embedded_resource_path : /embedded-1
1: Content-Type : text/plain; charset=windows-1252
2: X-Parsed-By : org.apache.tika.parser.DefaultParser
2: X-Parsed-By : org.apache.tika.parser.txt.TXTParser
2: embeddedResourceType : MACRO
2: Content-Encoding : windows-1252
2: X-TIKA:parse_time_millis : 4
2: X-TIKA:content : <html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.txt.TXTParser" />
<meta name="embeddedResourceType" content="MACRO" />
<meta name="Content-Encoding" content="windows-1252" />
<meta name="X-TIKA:embedded_resource_path" content="/embedded-2" />
<meta name="Content-Type" content="text/plain; charset=windows-1252" />
<title></title>
</head>
<body><p>Attribute VB_Name = "ThisDocument"
Attribute VB_Base = "1Normal.ThisDocument"
Attribute VB_GlobalNameSpace = False
Attribute VB_Creatable = False
Attribute VB_PredeclaredId = True
Attribute VB_Exposed = True
Attribute VB_TemplateDerived = True
Attribute VB_Customizable = True
</p>
</body></html>
2: X-TIKA:embedded_resource_path : /embedded-2
2: Content-Type : text/plain; charset=windows-1252
{noformat}
Is this good enough for now?
> Extract Macro text from Microsoft Office documents
> --------------------------------------------------
>
> Key: TIKA-2069
> URL: https://issues.apache.org/jira/browse/TIKA-2069
> Project: Tika
> Issue Type: Improvement
> Components: detector, parser
> Affects Versions: 1.13
> Environment: RHEL 5.x, Apache Tomcat
> Reporter: Jeff Swindle
> Labels: features
> Attachments: excel-macro.PNG, test-macro-doc.docm,
> test-macro-doc.docm-tika-app-output.txt, word-macro.PNG, xlsmacro.xlsm,
> xlsmacro.xlsm.tika-app-output.txt
>
>
> Tika supports macro-enabled Microsoft Office documents by extracting metadata
> and contents, however, macros within the document are not in the metadata or
> content output.
> Desire is to have the macro text extracted also.
> Info regarding macro extraction: http://www.decalage.info/vba_tools
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)