Tika

Djarmati, Sandor Tue, 17 Aug 2010 06:12:10 -0700

Hi,
 
I'm using Tika 0.7 in C# .Net for extracting text out of PDF Files.
It works fine, but has also some problems for example with the pdf file in the 
attachment.
In this pdf file there's some text written vertically (without any linereturn 
or sth.).
When the text is beeing extracted tika doesn't get the whole word,
instead it takes single letters and puts them as a 'word' (as u can see below).
 
Output from Tika:
 
################################################
 
Hallo das ist die ÜBERSCHRIFTHallo das ist die 
ÜBERSCHRIFT!! 
Ha
llo
 da
s is
t ei
n v
ert
ika
les
 TE
XT
FE
LD
 
 
Hallo das ist ein anderes vertikales TEXTFELD 
Hallo das ist ein horizontales TEXTFELD 
H
a
ll
o 
H
al
lo H
a
l
l
o


...
################################################
 
If anyone knows how to avoid it, please let me know.
My source code follows the example shown at this page:
http://blogs.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm.aspx
 
 



With best regards 

Sandor Djarmati 


  <http://www.roesberg.com/>  

Sandor Djarmati
Information Engineering
University of Cooperative
Education Karlsruhe
Student 


Phone:   +49 721 95018-0        
Fax:     +49 721 503266 
[email protected]    
www.roesberg.com <http://www.roesberg.com/>     


Roesberg Engineering - Ingenieurgesellschaft mbH für Automation
Industriestr.9, 76189 Karlsruhe, Germany 

Sitz der Gesellschaft: 76189 Karlsruhe
Geschaeftsfuehrer: Ute Heimann, Ralph Roesberg
Registergericht Mannheim HRB 104689 

________________________________

<<_RoesbergEmailLogo.gif>>

Tika

Reply via email to