Hi, I'm using Tika 0.7 in C# .Net for extracting text out of PDF Files. It works fine, but has also some problems for example with the pdf file in the attachment. In this pdf file there's some text written vertically (without any linereturn or sth.). When the text is beeing extracted tika doesn't get the whole word, instead it takes single letters and puts them as a 'word' (as u can see below). Output from Tika: ################################################ Hallo das ist die ÜBERSCHRIFTHallo das ist die ÜBERSCHRIFT!! Ha llo da s is t ei n v ert ika les TE XT FE LD Hallo das ist ein anderes vertikales TEXTFELD Hallo das ist ein horizontales TEXTFELD H a ll o H al lo H a l l o
... ################################################ If anyone knows how to avoid it, please let me know. My source code follows the example shown at this page: http://blogs.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm.aspx With best regards Sandor Djarmati <http://www.roesberg.com/> Sandor Djarmati Information Engineering University of Cooperative Education Karlsruhe Student Phone: +49 721 95018-0 Fax: +49 721 503266 [email protected] www.roesberg.com <http://www.roesberg.com/> Roesberg Engineering - Ingenieurgesellschaft mbH für Automation Industriestr.9, 76189 Karlsruhe, Germany Sitz der Gesellschaft: 76189 Karlsruhe Geschaeftsfuehrer: Ute Heimann, Ralph Roesberg Registergericht Mannheim HRB 104689 ________________________________
<<_RoesbergEmailLogo.gif>>
