Hi,

Current trunk/0.8RC seems to concatenate the PDF body from PDFBox into
one line. Last time I tested trunk, about a month ago, it did not. See
the following command line output:

$> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
1   ·   untitled 3   ·   2010-02-13 09:52   ·   Staffan Olsson
PDF Title For Short Document
veryshortpdfcontents

$> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="Content-Length" content="27166"/>
<meta name="subject" content="The PDF subject"/>
<meta name="Author" content="The PDF Author"/>
<meta name="Last-Modified" content="2010-02-13T08:52:56Z"/>
<meta name="AAPL:Keywords" content="keywordinsaveaspdf someotherkeyword"/>
<meta name="creator" content="Smultron"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="Creation-Date" content="2010-02-13T08:52:56Z"/>
<meta name="created" content="Sat Feb 13 09:52:56 CET 2010"/>
<meta name="producer" content="Mac OS X 10.6.2 Quartz PDFContext"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="resourceName" content="shortpdf.pdf"/>
<meta name="Keywords" content="keywordinsaveaspdf someotherkeyword"/>
<title>PDF Title For Short
Document</title>sols...@mackou:~/disk1/workspace/search/test$
</head>
<body>
<div class="page">
<p>1   ·   untitled 3   ·   2010-02-13 09:52   ·   Staffan OlssonPDF
Title For Short Documentveryshortpdfcontents</p>
</div>
</body>
</html>

$> java -jar tika-app-0.7.jar docs/shortpdf.pdf
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title>PDF Title For Short Document</title>
</head>
<body>
<div class="page">
<p>1      untitled 3      2010-02-13 09:52      Staan Olsson
PDF Title For Short Document
veryshortpdfcontents</p>
</div>
</body>
</html>

Should I report a bug?

/Staffan

Reply via email to