[
https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041639#comment-14041639
]
Steve R commented on TIKA-1353:
-------------------------------
Ignore my suggested code example, it clearly doesn't work.
My question is now this, why is the following code commented out? It seems to
work.
/*
* ZipFile zipFile; if (stream instanceof TikaInputStream) {
TikaInputStream tis = (TikaInputStream) stream;
* Object container = ((TikaInputStream) stream).getOpenContainer(); if
(container instanceof ZipFile) { zipFile
* = (ZipFile) container; } else if (tis.hasFile()) { zipFile = new
ZipFile(tis.getFile()); } }
*/
// TODO: if incoming IS is a TIS with a file
// associated, we should open ZipFile so we can
// visit metadata, mimetype first; today we lose
// all the metadata if meta.xml is hit after
// content.xml in the stream. Then we can still
// read-once for the content.xml.
> OpenDocumentParser doesn't correctly process metadata
> -----------------------------------------------------
>
> Key: TIKA-1353
> URL: https://issues.apache.org/jira/browse/TIKA-1353
> Project: Tika
> Issue Type: Bug
> Components: metadata, parser
> Affects Versions: 1.5
> Reporter: Steve R
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> When using OpenDocumentParser, the metadata isn't set correctly. When using
> it to write an html file, the only metadata that it knows about is content
> type because it is set ahead of time.
> The problem is that when iterating over the zip contents, meta.xml isn't
> processed before content.xml. The metadata set on the parse object is correct
> after parse() returns, however the contents of the resulting html file is
> missing all of the metadata.
> Changing the code to be
> boolean parsedMetaData = false;
> boolean delayLoadContent = false;
> while (entry != null) {
> ...
> } else if (entry.getName().equals("meta.xml")) {
> meta.parse(zip, new DefaultHandler(), metadata, context);
> parsedMetaData = true;
> if (delayLoadContent) {
> if (content instanceof OpenDocumentContentParser) {
> ((OpenDocumentContentParser)
> content).parseInternal(zip, handler, metadata, context);
> } else {
> // Foreign content parser was set:
> content.parse(zip, handler, metadata, context);
> }
> }
> } else if (entry.getName().endsWith("content.xml")) {
> if (!parsedMetaData) {
> delayLoadContent = true;
> } else {
> if (content instanceof OpenDocumentContentParser) {
> ((OpenDocumentContentParser)
> content).parseInternal(zip, handler, metadata, context);
> } else {
> // Foreign content parser was set:
> content.parse(zip, handler, metadata, context);
> }
> }
> }
> works as expected.
--
This message was sent by Atlassian JIRA
(v6.2#6252)