[
https://issues.apache.org/jira/browse/TIKA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492536#comment-15492536
]
Tim Barrett commented on TIKA-2058:
-----------------------------------
private void processFileEmbeddedInMsg(InformationGranule msgGranule, Path
resourceFilePath, ResourceSet parentResourceSet,
AttachmentChunks attachment) throws IOException,
Throwable, SAXException, TikaException {
ByteArrayInputStream byteInputStream = null;
try {
if (attachment.attachData != null) {
boolean isEmbeddedInMessage = true;
Path msgGranuleParentPath =
resourceFilePath.getParent();
byteInputStream = new
ByteArrayInputStream(attachment.attachData.getValue());
String embeddedFileName = null;
if (attachment.attachLongFileName != null &&
!attachment.attachLongFileName.toString().isEmpty()) {
embeddedFileName =
attachment.attachLongFileName.toString();
} else {
if (attachment.attachFileName != null
&& !attachment.attachFileName.toString().isEmpty()) {
embeddedFileName =
attachment.attachFileName.toString();
}
}
if (embeddedFileName != null) {
if (embeddedFileName.length() > 200) {
logger.warn("Embedded
attachment has filename longer than 200 characters: " + embeddedFileName);
String embeddedFileExtension =
NalandaStringUtilities.getTailLastOccurrence('.', embeddedFileName);
StringBuilder
strBldrEmbeddedFileName = new StringBuilder();
strBldrEmbeddedFileName.append(UUID.randomUUID().toString());
strBldrEmbeddedFileName.append(".");
strBldrEmbeddedFileName.append(embeddedFileExtension);
embeddedFileName =
strBldrEmbeddedFileName.toString();
logger.warn("Embedded
attachment has filename with long name saved as " + embeddedFileName);
}
NalandaResourceHandler
attachmentResourceHandler = new NalandaResourceHandler(this.parentResourceSet,
this.jsonParseFailures,
this.jsonPasswordFailures, this.filesCouldNotParseList);
Path embeddedResourcePath =
this.writeAttachmentToAttachmentsFolder((Resource) msgGranule, embeddedFileName,
byteInputStream,
parentResourceSet, msgGranuleParentPath, false, null);
if
(ResourceSetAccessor.getResourceType(new
File(embeddedFileName)).equals(RESOURCE_TYPE.ZIP)) {
embeddedResourcePath =
embeddedResourcePath.getParent();
}
attachmentResourceHandler.processEmbeddedResource(msgGranule, embeddedFileName,
null, parentResourceSet,
embeddedResourcePath,
null, null, null, isEmbeddedInMessage);
}
}
} finally {
if (byteInputStream != null) {
byteInputStream.close();
}
}
}
> Memory Leak in Tika version 1.13 when parsing millions of files
> ---------------------------------------------------------------
>
> Key: TIKA-2058
> URL: https://issues.apache.org/jira/browse/TIKA-2058
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.13
> Reporter: Tim Barrett
> Attachments: Yourkit screenshot.png, poi-3.15-beta1-p1.jar,
> poi-3.15-beta1-p1.pom, prevents-OOM-when-writable-is-false.patch,
> screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> We have an application using Tika which parses roughly 7,000,000 files of
> different types, many of the files are MSG files with attachments. This works
> correctly with Tika 1.9, and has been in production for over a year, with
> parsing runs taking place every few weeks. The same application runs into
> insufficient memory problems (java heap) when using Tika 1.13.
> I have used lsof and file leak detector to track down open files, however
> neither shows any open files when the application is running. I did find an
> issue with open files https://issues.apache.org/jira/browse/TIKA-2015,
> however there was a workaround for this and this is not the issue.
> I am sorry to have to report this with a level of vagueness, but with lsof
> turning nothing up I am a bit stuck as to how to investigate further. We are
> more than willing to help by testing on the basis of any ideas provided.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)