Hi, Current trunk/0.8RC seems to concatenate the PDF body from PDFBox into one line. Last time I tested trunk, about a month ago, it did not. See the following command line output:
$> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf 1 · untitled 3 · 2010-02-13 09:52 · Staffan Olsson PDF Title For Short Document veryshortpdfcontents $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf <?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="Content-Length" content="27166"/> <meta name="subject" content="The PDF subject"/> <meta name="Author" content="The PDF Author"/> <meta name="Last-Modified" content="2010-02-13T08:52:56Z"/> <meta name="AAPL:Keywords" content="keywordinsaveaspdf someotherkeyword"/> <meta name="creator" content="Smultron"/> <meta name="xmpTPg:NPages" content="1"/> <meta name="Creation-Date" content="2010-02-13T08:52:56Z"/> <meta name="created" content="Sat Feb 13 09:52:56 CET 2010"/> <meta name="producer" content="Mac OS X 10.6.2 Quartz PDFContext"/> <meta name="Content-Type" content="application/pdf"/> <meta name="resourceName" content="shortpdf.pdf"/> <meta name="Keywords" content="keywordinsaveaspdf someotherkeyword"/> <title>PDF Title For Short Document</title>sols...@mackou:~/disk1/workspace/search/test$ </head> <body> <div class="page"> <p>1 · untitled 3 · 2010-02-13 09:52 · Staffan OlssonPDF Title For Short Documentveryshortpdfcontents</p> </div> </body> </html> $> java -jar tika-app-0.7.jar docs/shortpdf.pdf <?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>PDF Title For Short Document</title> </head> <body> <div class="page"> <p>1 untitled 3 2010-02-13 09:52 Staan Olsson PDF Title For Short Document veryshortpdfcontents</p> </div> </body> </html> Should I report a bug? /Staffan
