[
https://issues.apache.org/jira/browse/TIKA-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Burch updated TIKA-2524:
-----------------------------
Attachment: WithBiDi.xps
I've used PowerPoint 2010 to create a XPS file with some English and Arabic in
it (various forms of saying hello), hopefully that helps with understanding LTR
vs RTL text! It should be fine to use for unit testing too
> Create/integrate a parser for XPS
> ---------------------------------
>
> Key: TIKA-2524
> URL: https://issues.apache.org/jira/browse/TIKA-2524
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.16
> Reporter: Peter Davies
> Labels: features
> Attachments: A3S3TDRXL6DN2AN3NU2OE5L7KGFY6DZA.xps, WithBiDi.xps,
> doc_xps.xps
>
>
> When we parse XPS files using the AutoParser we always get an empty string.
> If we use DefaultDetector.detect() it correctly detects the MediaType as
> "application/vnd.ms-xpsdocument".
> This page
> https://tika.apache.org/1.16/formats.html
> suggests that XPS (application/vnd.ms-xpsdocument) is supported however.
> Our code:
> InputStream bis = this.getClass().getResourceAsStream("/" +
> EXPECTED_LOCATION + "doc_xps.xps");
> Metadata metadata = new Metadata();
> BodyContentHandler handler = new BodyContentHandler();
> AutoDetectParser parser = new AutoDetectParser();
> TikaInputStream tikaStream = TikaInputStream.get(bis);
> parser.parse(tikaStream, handler, metadata);
> String parsedText = handler.toString();
> I will attach doc_xps.xps if I can
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)