[
https://issues.apache.org/jira/browse/TIKA-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Davies updated TIKA-2524:
-------------------------------
Description:
When we parse XPS files using the AutoParser we always get an empty string.
If we use DefaultDetector.detect() it correctly detects the MediaType as
"application/vnd.ms-xpsdocument".
This page
https://tika.apache.org/1.16/formats.html
suggests that XPS (application/vnd.ms-xpsdocument) is supported however.
Our code:
InputStream bis = this.getClass().getResourceAsStream("/" +
EXPECTED_LOCATION + "doc_xps.xps");
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
TikaInputStream tikaStream = TikaInputStream.get(bis);
parser.parse(tikaStream, handler, metadata);
String parsedText = handler.toString();
I will attach doc_xps.xps if I can
was:
When we parse XPS files using the AutoParser we always get an empty string.
If we use DefaultDetector.detect() it correctly detects the MediaType as
"application/vnd.ms-xpsdocument".
This page
https://tika.apache.org/1.16/formats.html
suggests that XPS (application/vnd.ms-xpsdocument) is supported however.
Our code:
InputStream bis = new BufferedInputStream(
this.getClass().getResourceAsStream("/" +
EXPECTED_LOCATION + "doc_xps.xps"));
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
TikaInputStream tikaStream = TikaInputStream.get(bis);
parser.parse(tikaStream, handler, metadata);
String parsedText = handler.toString();
I will attach doc_xps.xps if I can
> Apache Tika returns empty string when parsing text from XPS files
> -----------------------------------------------------------------
>
> Key: TIKA-2524
> URL: https://issues.apache.org/jira/browse/TIKA-2524
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.16
> Reporter: Peter Davies
> Labels: features
> Attachments: doc_xps.xps
>
>
> When we parse XPS files using the AutoParser we always get an empty string.
> If we use DefaultDetector.detect() it correctly detects the MediaType as
> "application/vnd.ms-xpsdocument".
> This page
> https://tika.apache.org/1.16/formats.html
> suggests that XPS (application/vnd.ms-xpsdocument) is supported however.
> Our code:
> InputStream bis = this.getClass().getResourceAsStream("/" +
> EXPECTED_LOCATION + "doc_xps.xps");
> Metadata metadata = new Metadata();
> BodyContentHandler handler = new BodyContentHandler();
> AutoDetectParser parser = new AutoDetectParser();
> TikaInputStream tikaStream = TikaInputStream.get(bis);
> parser.parse(tikaStream, handler, metadata);
> String parsedText = handler.toString();
> I will attach doc_xps.xps if I can
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)