Re: [gentoo-user] multi-region OCR
On Thu, 01 Dec 2016 10:51:20 +0100, Helmut Jarausch wrote: > > THere's an ebuild on bgo that I've kept updated to the latest release, > > I've attached it. However, it uses tesseract as the OCR engine, so I > > would expect similar results. > the ebuild you've shared has a dependency on > perl-gcpan/Linux-Distribution which I don't have > in my Gentoo tree. Have you got a fix? Oh yes, I added that after later releases started needing that module. You need to install app-portage/g-cpan and then use that to add the Linux-distribution module to portage. Or you can just install the module with gpan and remove the dependency from the ebuild, but that's even more kludgy. -- Neil Bothwick God said, "div D = rho, div B = 0, curl E = - @B/@t, curl H = J + @D/@t," and there was light. pgpB4GO2pc_AW.pgp Description: OpenPGP digital signature
Re: [gentoo-user] multi-region OCR
On 11/30/2016 07:37:20 PM, Neil Bothwick wrote: On Wed, 30 Nov 2016 13:28:15 -0500, Michael Mol wrote: > The next tool that looked like it might work, gscan2pdf, wasn't in > portage, and with the semi-garbled output from tesseract suggesting the > scans were too poor quality, I didn't pursue further. THere's an ebuild on bgo that I've kept updated to the latest release, I've attached it. However, it uses tesseract as the OCR engine, so I would expect similar results. -- Neil Bothwick Hi Neil, the ebuild you've shared has a dependency on perl-gcpan/Linux-Distribution which I don't have in my Gentoo tree. Have you got a fix? Thanks for this ebuild, Helmut
Re: [gentoo-user] multi-region OCR
Did you train tesseract per chance? And could I get some sample images? Landis On 11/30/2016 12:28 PM, Michael Mol wrote: On Wednesday, November 30, 2016 05:34:25 PM J. Roeleveld wrote: On November 30, 2016 6:03:36 PM GMT+01:00, Michael Molwrote: On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote: On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote: Michael Mol: ... xsane would have let me do it during the scan process if I'd thought of it then, but the scans are done, drives aren't there any more. Something ... If xsane solves your need why don't you just print your scans so xsane can do its job ? There has to be a way to do this without killing an entire forest... And big chunks of ink cartridges. The scans stretched the contrast so I can clearly read the drive labels through the translucent anti-static bags, which means a huge chunk of the image (what's outside the labels) is pure black. Which I could get around by spending fifteen minutes munging things in the Gimp before printing, but at that point, I may as well just transcribe things manually at that point. Looking for something reasonably simple to improve the general workflow. I'd have hoped something would have already been available on Linux; it'd be easy enough to copy the scans to my phone and feed them through Google Goggles for the desired output, but then I'm deliberately filtering company data through an outside entity. Did you manage to use that link I sent? I did. tesseract almost worked, even separating the regions cleanly in its output, but it seems, sadly, that the 300dpi scans were insufficient to get a good read; lots of clear corruption of the text, so things like serial numbers, model numbers, version numbers--everything you'd care about--would be highly suspect. The next tool that looked like it might work, gscan2pdf, wasn't in portage, and with the semi-garbled output from tesseract suggesting the scans were too poor quality, I didn't pursue further.
Re: [gentoo-user] multi-region OCR
2016-11-30 16:28 GMT-02:00 Michael Mol: > On Wednesday, November 30, 2016 05:34:25 PM J. Roeleveld wrote: > > On November 30, 2016 6:03:36 PM GMT+01:00, Michael Mol < > mike...@gmail.com> > wrote: > > >On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote: > > >> On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote: > > >> > Michael Mol: > > >> > ... > > >> > > > >> > > xsane would have let me do it during the scan process if I'd > > > > > >thought of > > > > > >> > > it > > >> > > then, but the scans are done, drives aren't there any more. > > > > > >Something > > > > > >> > ... > > >> > > > >> > If xsane solves your need why don't you just print your scans so > > > > > >xsane > > > > > >> > can do its job ? > > >> > > >> There has to be a way to do this without killing an entire forest... > > > > > >And big chunks of ink cartridges. The scans stretched the contrast so I > > >can > > >clearly read the drive labels through the translucent anti-static bags, > > >which > > >means a huge chunk of the image (what's outside the labels) is pure > > >black. > > > > > >Which I could get around by spending fifteen minutes munging things in > > >the Gimp > > >before printing, but at that point, I may as well just transcribe > > >things > > >manually at that point. > > > > > >Looking for something reasonably simple to improve the general > > >workflow. I'd > > >have hoped something would have already been available on Linux; it'd > > >be easy > > >enough to copy the scans to my phone and feed them through Google > > >Goggles for > > >the desired output, but then I'm deliberately filtering company data > > >through an > > >outside entity. > > > > Did you manage to use that link I sent? > > I did. tesseract almost worked, even separating the regions cleanly in its > output, but it seems, sadly, that the 300dpi scans were insufficient to > get a > good read; lots of clear corruption of the text, so things like serial > numbers, model numbers, version numbers--everything you'd care > about--would be > highly suspect. > > The next tool that looked like it might work, gscan2pdf, wasn't in portage, > and with the semi-garbled output from tesseract suggesting the scans were > too > poor quality, I didn't pursue further. > > -- > :wq Well, I've had similar issue. I had gimp to resize the image to its double (width and height, of course), filtered it a bit (edge enhancement) and split the image in several ones for the regions of interest. Of course, there might be an easier way ;-) Francisco
Re: [gentoo-user] multi-region OCR
On Wed, 30 Nov 2016 13:28:15 -0500, Michael Mol wrote: > The next tool that looked like it might work, gscan2pdf, wasn't in > portage, and with the semi-garbled output from tesseract suggesting the > scans were too poor quality, I didn't pursue further. THere's an ebuild on bgo that I've kept updated to the latest release, I've attached it. However, it uses tesseract as the OCR engine, so I would expect similar results. -- Neil Bothwick Do Roman paramedics refer to IV's as "4's"? gscan2pdf-1.5.5.ebuild Description: Binary data pgpa1iohJbryv.pgp Description: OpenPGP digital signature
Re: [gentoo-user] multi-region OCR
On Wednesday, November 30, 2016 05:34:25 PM J. Roeleveld wrote: > On November 30, 2016 6:03:36 PM GMT+01:00, Michael Molwrote: > >On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote: > >> On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote: > >> > Michael Mol: > >> > ... > >> > > >> > > xsane would have let me do it during the scan process if I'd > > > >thought of > > > >> > > it > >> > > then, but the scans are done, drives aren't there any more. > > > >Something > > > >> > ... > >> > > >> > If xsane solves your need why don't you just print your scans so > > > >xsane > > > >> > can do its job ? > >> > >> There has to be a way to do this without killing an entire forest... > > > >And big chunks of ink cartridges. The scans stretched the contrast so I > >can > >clearly read the drive labels through the translucent anti-static bags, > >which > >means a huge chunk of the image (what's outside the labels) is pure > >black. > > > >Which I could get around by spending fifteen minutes munging things in > >the Gimp > >before printing, but at that point, I may as well just transcribe > >things > >manually at that point. > > > >Looking for something reasonably simple to improve the general > >workflow. I'd > >have hoped something would have already been available on Linux; it'd > >be easy > >enough to copy the scans to my phone and feed them through Google > >Goggles for > >the desired output, but then I'm deliberately filtering company data > >through an > >outside entity. > > Did you manage to use that link I sent? I did. tesseract almost worked, even separating the regions cleanly in its output, but it seems, sadly, that the 300dpi scans were insufficient to get a good read; lots of clear corruption of the text, so things like serial numbers, model numbers, version numbers--everything you'd care about--would be highly suspect. The next tool that looked like it might work, gscan2pdf, wasn't in portage, and with the semi-garbled output from tesseract suggesting the scans were too poor quality, I didn't pursue further. -- :wq signature.asc Description: This is a digitally signed message part.
Re: [gentoo-user] multi-region OCR
On November 30, 2016 6:03:36 PM GMT+01:00, Michael Molwrote: >On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote: >> On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote: >> > Michael Mol: >> > ... >> > >> > > xsane would have let me do it during the scan process if I'd >thought of >> > > it >> > > then, but the scans are done, drives aren't there any more. >Something >> > >> > ... >> > >> > If xsane solves your need why don't you just print your scans so >xsane >> > can do its job ? >> >> There has to be a way to do this without killing an entire forest... > >And big chunks of ink cartridges. The scans stretched the contrast so I >can >clearly read the drive labels through the translucent anti-static bags, >which >means a huge chunk of the image (what's outside the labels) is pure >black. > >Which I could get around by spending fifteen minutes munging things in >the Gimp >before printing, but at that point, I may as well just transcribe >things >manually at that point. > >Looking for something reasonably simple to improve the general >workflow. I'd >have hoped something would have already been available on Linux; it'd >be easy >enough to copy the scans to my phone and feed them through Google >Goggles for >the desired output, but then I'm deliberately filtering company data >through an >outside entity. Did you manage to use that link I sent? -- Joost -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [gentoo-user] multi-region OCR
On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote: > On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote: > > Michael Mol: > > ... > > > > > xsane would have let me do it during the scan process if I'd thought of > > > it > > > then, but the scans are done, drives aren't there any more. Something > > > > ... > > > > If xsane solves your need why don't you just print your scans so xsane > > can do its job ? > > There has to be a way to do this without killing an entire forest... And big chunks of ink cartridges. The scans stretched the contrast so I can clearly read the drive labels through the translucent anti-static bags, which means a huge chunk of the image (what's outside the labels) is pure black. Which I could get around by spending fifteen minutes munging things in the Gimp before printing, but at that point, I may as well just transcribe things manually at that point. Looking for something reasonably simple to improve the general workflow. I'd have hoped something would have already been available on Linux; it'd be easy enough to copy the scans to my phone and feed them through Google Goggles for the desired output, but then I'm deliberately filtering company data through an outside entity. -- :wq signature.asc Description: This is a digitally signed message part.
Re: [gentoo-user] multi-region OCR
On Tuesday, November 29, 2016 01:33:48 PM Michael Mol wrote: > So, I've got scans of a half dozen new hard drives, and I've got scans of > their labels. One image has two drives, the other has four. > > Rather than manually transcribing the label contents into my intake ticket, > I'd like to select a region of each image and OCR it. (Darn, it'd be handy > if they put all this metadata into a QR code...) > > What tools exist to let me do this? Keep in mind, I've got multiple regions > I need to OCR, and the regions aren't going to be consistent across images. > > xsane would have let me do it during the scan process if I'd thought of it > then, but the scans are done, drives aren't there any more. Something > reasonably similar would be nice. Okular is reputed to have some OCR > capability, but I can't find it. Dolphin is supposed to be able to do it if > you have tesserract installed (I do), but I can't find the service to > enable. I could use some pointers... Quick search: https://help.ubuntu.com/community/OCR This contains some example-scripts for several OCR tools. -- Joost PS. I used a similar approach once to fix a PDF from an HR-department to enable searching. They typed a document in MS Word, printed it, then scanned it into a PDF... Merging the PDF with the OCR-results was quite nice as well
Re: [gentoo-user] multi-region OCR
On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote: > Michael Mol: > ... > > > xsane would have let me do it during the scan process if I'd thought of it > > then, but the scans are done, drives aren't there any more. Something > > ... > > If xsane solves your need why don't you just print your scans so xsane > can do its job ? There has to be a way to do this without killing an entire forest... -- Joost
Re: [gentoo-user] multi-region OCR
Michael Mol: ... > xsane would have let me do it during the scan process if I'd thought of it > then, but the scans are done, drives aren't there any more. Something ... If xsane solves your need why don't you just print your scans so xsane can do its job ? Regards, /Karl Hammar --- Aspö Data Lilla Aspö 148 S-742 94 Östhammar Sweden +46 173 140 57
[gentoo-user] multi-region OCR
So, I've got scans of a half dozen new hard drives, and I've got scans of their labels. One image has two drives, the other has four. Rather than manually transcribing the label contents into my intake ticket, I'd like to select a region of each image and OCR it. (Darn, it'd be handy if they put all this metadata into a QR code...) What tools exist to let me do this? Keep in mind, I've got multiple regions I need to OCR, and the regions aren't going to be consistent across images. xsane would have let me do it during the scan process if I'd thought of it then, but the scans are done, drives aren't there any more. Something reasonably similar would be nice. Okular is reputed to have some OCR capability, but I can't find it. Dolphin is supposed to be able to do it if you have tesserract installed (I do), but I can't find the service to enable. I could use some pointers... -- :wq signature.asc Description: This is a digitally signed message part.