[
https://issues.apache.org/jira/browse/TIKA-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289426#comment-16289426
]
Hudson commented on TIKA-2524:
------------------------------
SUCCESS: Integrated in Jenkins build Tika-trunk #1415 (See
[https://builds.apache.org/job/Tika-trunk/1415/])
TIKA-2524 -- add an XPS parser (tallison:
[https://github.com/apache/tika/commit/533354d1a0e7e0b3b4ced7b7958f6e34041bf502])
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java
* (add)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xps/XPSPageContentHandler.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java
* (add)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xps/XPSExtractorDecorator.java
* (add)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xps/XPSTextExtractor.java
* (edit) CHANGES.txt
* (add) tika-parsers/src/test/resources/test-documents/testXPS_various.xps
* (add)
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xps/XPSParserTest.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.java
> Create/integrate a parser for XPS
> ---------------------------------
>
> Key: TIKA-2524
> URL: https://issues.apache.org/jira/browse/TIKA-2524
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.16
> Reporter: Peter Davies
> Assignee: Tim Allison
> Labels: features
> Fix For: 2.0, 1.18
>
> Attachments: A3S3TDRXL6DN2AN3NU2OE5L7KGFY6DZA.xps, WithBiDi.xps,
> doc_xps.xps
>
>
> When we parse XPS files using the AutoParser we always get an empty string.
> If we use DefaultDetector.detect() it correctly detects the MediaType as
> "application/vnd.ms-xpsdocument".
> This page
> https://tika.apache.org/1.16/formats.html
> suggests that XPS (application/vnd.ms-xpsdocument) is supported however.
> Our code:
> InputStream bis = this.getClass().getResourceAsStream("/" +
> EXPECTED_LOCATION + "doc_xps.xps");
> Metadata metadata = new Metadata();
> BodyContentHandler handler = new BodyContentHandler();
> AutoDetectParser parser = new AutoDetectParser();
> TikaInputStream tikaStream = TikaInputStream.get(bis);
> parser.parse(tikaStream, handler, metadata);
> String parsedText = handler.toString();
> I will attach doc_xps.xps if I can
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)