[ 
https://issues.apache.org/jira/browse/PDFBOX-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983086#comment-14983086
 ] 

Tilman Hausherr commented on PDFBOX-3058:
-----------------------------------------

govdocs file 004486.pdf has this in the json for 2.0:
{code}
"X-TIKA::xmp_exception": "java.io.IOException: Element type \"http:\" must be 
followed by either attribute specifications, \">\" or \"/>\".\n\tat 
org.apache.jempbox.impl.XMLUtil.parse(XMLUtil.java:81)\n\tat 
org.apache.jempbox.xmp.XMPMetadata.load(XMPMetadata.java:554)\n\tat 
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:201)\n\tat 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:131)\n\tat 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)\n\tat 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)\n\tat 
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)\n\tat 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)\n\tat
 
org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:410)\n\tat
 
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)\n\tat
 
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182)\n\tat
 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)\n\tat
 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:49)\n\tat
 java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat 
java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat
 java.lang.Thread.run(Thread.java:745)\n",
{code}
I suspect that others have the same, I noticed many govdocs files have 1 more 
meta element.

> Support TIKA Migration to PDFBox 2.0
> ------------------------------------
>
>                 Key: PDFBOX-3058
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3058
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Maruan Sahyoun
>         Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_1_8_10.json, 
> NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_2_0.json, content_diffs-1.8-to-2.0.xlsx
>
>
> This issue is to track fixing issues which came up as part of TIKA-1285 
> (Upgrade to PDFBox 2.0.0 when available) mainly
> - new exceptions compared to PDFBox 1.8.x
> - regressions in text extraction
> - lower quality text extraction
> There should be individual issues to track tasks/bugs arising from that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to