Tim Allison created TIKA-4713:
---------------------------------

             Summary: Make resourcename and internal path consistent across 
parsers for embedded files
                 Key: TIKA-4713
                 URL: https://issues.apache.org/jira/browse/TIKA-4713
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


On TIKA-4705, thanks to [~iachimoe], I noticed that the package parsers are 
including path information in resource name.

I asked an agent to go through the parsers in tika-parsers-standard and report 
on how individual parsers are pre-processing names for resource name and to 
check which parsers are using INTERNAL_PATH (TIKA-4630).

We should do a sweep of our parsers, or better yet, centralize the processing 
of resource name and internal path for embedded files.

 

This is the table from the agent:
h3. Parsers that include path in RESOURCE_NAME_KEY
||Module||Class||Sets INTERNAL_PATH?||Sets ORIGINAL_RESOURCE_NAME?||
|tika-parser-pkg-module|AbstractArchiveParser (tar, 7z, ar, cpio)|Yes (same 
value)|No|
|tika-parser-pkg-module|ZipParser|Yes (same value)|No|
|tika-parser-miscoffice-module|EpubParser|No|No|
|tika-parser-microsoft-module|AbstractPOIFSExtractor (OLE10Native case)|No|No|
h3. Parsers that use name-only (majority)
||Module||Class||Uses FilenameUtils.getName()?||Sets ORIGINAL_RESOURCE_NAME?||
|tika-parser-pkg-module|UnrarParser|Yes|Yes|
|tika-parser-pkg-module|CompressorParser|No (already name-only from gzip 
header)|No|
|tika-parser-pdf-module|AbstractPDF2XHTML|No|Yes|
|tika-parser-microsoft-module|AbstractOOXMLExtractor (thumbnails)|Yes|No|
|tika-parser-microsoft-module|AbstractOOXMLExtractor (OLE embedded)|No|Yes|
|tika-parser-microsoft-module|AbstractPOIFSExtractor (non-OLE10 cases)|No|No|
|tika-parser-microsoft-module|EmailVisitor (libpst)|No (uses getFileName())|Yes 
(via INTERNAL_PATH)|
|tika-parser-microsoft-module|OutlookPSTParser|No|No (but sets INTERNAL_PATH 
with folder path)|
|tika-parser-microsoft-module|OfficeParser (VBA macros)|No|No|
|tika-parser-microsoft-module|TNEFParser|No|No|
|tika-parser-microsoft-module|WordMLParser|No|Yes|
|tika-parser-microsoft-module|RTFEmbObjHandler|Yes|Yes|
|tika-parser-microsoft-module|RTFObjDataParser|Yes|Yes|
|tika-parser-microsoft-module|RTFObjDataStreamParser|Yes|Yes|
|tika-parser-microsoft-module|RTFEmbeddedHandler (jflex)|Yes|Yes|
|tika-parser-mail-module|MailContentHandler|No|No|
|tika-parser-miscoffice-module|OpenDocumentParser|No|Yes (via INTERNAL_PATH)|
|tika-parser-miscoffice-module|FlatOpenDocumentMacroHandler|No|No|
|tika-parser-webarchive-module|WARCParser|No|No|
|tika-parser-webarchive-module|WACZParser|No|Yes (via INTERNAL_PATH)|
|tika-parser-xml-module|FictionBookParser|No|No|
|tika-parser-apple-module|IWork13PackageParser|No|No|
h3. Parsers that generate synthetic names
||Module||Class||Pattern||
|tika-parser-pdf-module|ImageGraphicsEngine|image-N.ext|
|tika-parser-microsoft-module|PSTMailItemParser|subject.msg|
|tika-parser-mail-module|MailContentHandler|subject.eml|
|tika-parser-code-module|XHTMLClassVisitor|className.class|
|tika-parser-jdbc-commons|JDBCTableReader|column_row.txt|
|tika-core|EmbeddedDocumentUtil|embedded-N.ext or thumbnail-N.ext|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to