FWD’ing to the Tika list (note TO: address change)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 180-503E, Mailstop: 180-503 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From: Ravi Gadapa <[email protected]> Date: Monday, June 19, 2017 at 8:56 PM To: "[email protected]" <[email protected]> Subject: Tesseract - OCR and Tika I have been using it for our project and i seem to have problem extracting the data from pdf documents. Below is the sample how it extracts. 'EldAJ. iNEIWEI‘IEI ‘IVHG El‘c'l TIVHS SEIHOJJMS TIV "8 'NOILVGNEIWINOOEIEI ElElElfliOVdflNVW iNEIWdIflOEI ElElcl SV 3|in EIWVN S.J_NE|V\ld|flOE| NO GEISVEI EIEI TIVHS HOJJMS iOEINNOOSIG iNEIWdIflOEI HO:| EIZIS ElSflzl TIV 'Z 'GEliON EISIMEIEIHLO SSEI‘INH ‘EldAJ. EltlflSO‘IONEI HS VINEIN NI EIEI TIVHS SEIHOJJMS iOEINNOOSIG HOOGiflO TIV 'L Any suggestions Thanks
