On Tue, Dec 02, 2008 at 02:22:27PM -0500, Chris Shenton wrote:
> Gary Kline <[EMAIL PROTECTED]> writes:
> 
> >     pdftotext fail on the large [32MB] file I've got.  Is there any other 
> > way I
> >     can translate this huge textfile to ascii or html or text?
> 
> I wrote some code using Python PDF library 'pypdf' to split a multipage
> PDF scan into individual pages, then used the tesseract OCR to convert
> to text.  Not 100% of course, and it really got confused by pages that
> were not right-side-up, but not a bad start for pages that are really
> scans -- images -- rather than PDF representation of text. 
> 
> Sadly, I haven't gotten it into a suitable state to release. 


        Well, sounds hopeful for when I scan around 200 pages of pre-1923 
journal 
        articles.  These are in columnal form IIRC correctly.  

        --Be WONDERFUL if there were some kind of hardware top translate Old 
books
        and journals automagically.  ... .

        gary



-- 
 Gary Kline  [EMAIL PROTECTED]  http://www.thought.org  Public Service Unix
        http://jottings.thought.org   http://transfinite.thought.org
 Flash: The alpha release of Jottings is available: 
http://jottings.thought.org/index.php

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to