Re: [CODE4LIB] pdf2txt [tesseract]

2013-10-18 Thread Christian Pietsch
Hi Padraic, I have uploaded a shell script which happens to implement Robert Haschart's recipe: https://github.com/pietsch/Data-Munging/blob/master/ocr4pdf.sh Enjoy! Christian On Fri, Oct 18, 2013 at 10:22:17AM +0100, Padraic Stack wrote: > I would love to see that bash script if you could uplo

Re: [CODE4LIB] pdf2txt [tesseract]

2013-10-17 Thread Robert Haschart
On 10/17/2013 9:43 AM, Eric Lease Morgan wrote: On Oct 16, 2013, at 10:56 AM, Robert Haschart wrote: The abstract extraction routine I have been working on does use tesseract internally for doing OCR when it encounters a document that doesn't have usable full-text. I agree that tesseract is n

Re: [CODE4LIB] pdf2txt [tesseract]

2013-10-17 Thread Christian Pietsch
Hi Eric, On Thu, Oct 17, 2013 at 09:43:04AM -0400, Eric Lease Morgan wrote: > Robert, can you outline the process you used to get Tesseract to do > OCR agains PDF documents? I installed Tesseract a few months ago, > but I couldn't figure out how to get to work against PDF, only some > image files.

Re: [CODE4LIB] pdf2txt [tesseract]

2013-10-17 Thread Eric Lease Morgan
On Oct 16, 2013, at 10:56 AM, Robert Haschart wrote: > The abstract extraction routine I have been working on does use > tesseract internally for doing OCR when it encounters a document that > doesn't have usable full-text. I agree that tesseract is not that easy > to install, especially if (