[
https://issues.apache.org/jira/browse/TIKA-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528305#comment-17528305
]
Tim Allison commented on TIKA-3738:
-----------------------------------
K, what's happening is the ForkParser in this mode is only returning
effectively the xhtml, which means that any metadata written after the first
body character is not written to the xhtml, and is then not
reconstituted/extracted by the MetadataContentHandler on the client side.
The only fix for this would be to return the Metadata object as well as the
contenthandler.
If you use the RecursiveContentHandler, you'll get all of the metadata.
{noformat}
@Test
public void testForkedM4AParsing() throws Exception {
RecursiveParserWrapper wrapper = new
RecursiveParserWrapper(tika.getParser());
int numThreads = 5;
try (ForkParser parser = new
ForkParser(ForkParserIntegrationTest.class.getClassLoader(),
wrapper)) {
RecursiveParserWrapperHandler contentHandler =
new RecursiveParserWrapperHandler(new
BasicContentHandlerFactory(
BasicContentHandlerFactory.HANDLER_TYPE.BODY,
100000));
InputStream stream =
getResourceAsStream("/test-documents/testMP4.m4a");
ParseContext context = new ParseContext();
context.set(Parser.class, new EmptyParser());
Metadata metadata = new Metadata();
parser.parse(stream, contentHandler, metadata, context);
debug(contentHandler.getMetadataList());
metadata = contentHandler.getMetadataList().get(0);
assertEquals("Test Title", metadata.get(TikaCoreProperties.TITLE));
assertEquals("Test Album Artist", metadata.get(XMPDM.ALBUM_ARTIST));
assertEquals("Test Album", metadata.get(XMPDM.ALBUM));
assertEquals("Test Genre", metadata.get(XMPDM.GENRE));
}
}
{noformat}
> ForkParser missing metadata for some document formats
> -----------------------------------------------------
>
> Key: TIKA-3738
> URL: https://issues.apache.org/jira/browse/TIKA-3738
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.3.0
> Environment: Java 11.0.14.
> Reporter: Stephen H
> Priority: Major
> Attachments: ForkParserIntegrationTest.java.diff,
> testVideoMetadataMp4.mp4
>
>
> When using ForkParser, metadata from some parsers is not being returned in
> the Metadata object or in the head of the returned XML. These include
> OpenDocument Presentation (ODP), OpenDocument Spreadsheet (ODS), Microsoft
> Word 2006 XML, MP4 Audio (M4A) and MP4 Video (MP4).
> Patch for ForkParserIntegrationTest showing the issue for these file types is
> attached, along with an MP4 video file containing metadata as there doesn't
> appear to be one currently in the test set.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)