Hi, Just to update, I got in touch with Mr.PG and we have setup a workflow on a cloud server and it's chugging along nicely.
What the program does is cool - it implements a python library: ocrmypdf in bulk mode. This description from their docs is what it's mainly doing: OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF files, allowing them to be searched. I made some tweaks to PG's program, have put it on github here: https://github.com/answerquest/bulk_pdf_OCR/ I think it may be useful at other places too. -- Cheers, Nikhil VJ https://nikhilvj.co.in On Tue, Nov 24, 2020 at 12:59 AM Anirudh K <[email protected]> wrote: > Hi all, > > The Chief Electoral Officer - Karnataka has published a new version of > Electoral Rolls. These are image based PDFs that have to be converted to > text based PDFs. > > There is a need for additional compute resources to convert these large > files. If anyone would like help with this, the process would entail > running a python script (already made) on Google Colab and sharing the > output folder on Google Drive. A more technical description of the process > is detailed below. > > Please reach out to [email protected] (or call PG Bhat - 9900141232) to > help out with this project, or in case of any queries. > > The full process: > > 1. Create a shared folder on Drive called 'ERMS' and give edit access > to [email protected]. > 2. He will create 3 subfolders: > - *Code* - This will contain the script. There is no need for any > software to be installed locally. > - *Image files* - This houses the image files > - *Text files* - where the script will write the results > 3. Run the script on Colab (free account). The text files can then be > downloaded from the drive folder > > Thank you for considering this request. > > Regards, > Anirudh > > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/datameet/be9e4621-03a6-4e7e-8dfd-51ab93478b4en%40googlegroups.com > <https://groups.google.com/d/msgid/datameet/be9e4621-03a6-4e7e-8dfd-51ab93478b4en%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAH7jeuNTvXDhKG64jhwJ94GY6U_EVyxFupdtOnajbQfYmK_pqA%40mail.gmail.com.
