On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote:
> Gary Kline <[EMAIL PROTECTED]> writes:
>
> > pdftotext fail on the large [32MB] file I've got. Is there any other
> > way I
> > can translate this huge textfile to ascii or html or text?
>
> I wrote some code using Python PDF library 'pypdf' to split a multipage
> PDF scan into individual pages, then used the tesseract OCR to convert
> to text. Not 100% of course, and it really got confused by pages that
> were not right-side-up, but not a bad start for pages that are really
> scans -- images -- rather than PDF representation of text.
>
> Sadly, I haven't gotten it into a suitable state to release.
Well, sounds hopeful for when I scan around 200 pages of pre-1923
journal
articles. These are in columnal form IIRC correctly.
--Be WONDERFUL if there were some kind of hardware top translate Old
books
and journals automagically. ... .
gary
--
Gary Kline [EMAIL PROTECTED] http://www.thought.org Public Service Unix
http://jottings.thought.org http://transfinite.thought.org
Flash: The alpha release of Jottings is available:
http://jottings.thought.org/index.php
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"