Re: [gentoo-user] multi-region OCR

2016-12-01 Thread Neil Bothwick
On Thu, 01 Dec 2016 10:51:20 +0100, Helmut Jarausch wrote:

> > THere's an ebuild on bgo that I've kept updated to the latest release,
> > I've attached it. However, it uses tesseract as the OCR engine, so I
> > would expect similar results.

> the ebuild you've shared has a dependency on  
> perl-gcpan/Linux-Distribution which I don't have
> in my Gentoo tree. Have you got a fix?

Oh yes, I added that after later releases started needing that module.
You need to install app-portage/g-cpan and then use that to add the
Linux-distribution module to portage.

Or you can just install the module with gpan and remove the dependency
from the ebuild, but that's even more kludgy.


-- 
Neil Bothwick

God said, "div D = rho, div B = 0, curl E = - @B/@t, curl H = J + @D/@t,"
and there was light.


pgpB4GO2pc_AW.pgp
Description: OpenPGP digital signature


Re: [gentoo-user] multi-region OCR

2016-12-01 Thread Helmut Jarausch

On 11/30/2016 07:37:20 PM, Neil Bothwick wrote:

On Wed, 30 Nov 2016 13:28:15 -0500, Michael Mol wrote:

> The next tool that looked like it might work, gscan2pdf, wasn't in
> portage, and with the semi-garbled output from tesseract suggesting  
the

> scans were too poor quality, I didn't pursue further.

THere's an ebuild on bgo that I've kept updated to the latest release,
I've attached it. However, it uses tesseract as the OCR engine, so I
would expect similar results.


--
Neil Bothwick


Hi Neil,
the ebuild you've shared has a dependency on  
perl-gcpan/Linux-Distribution which I don't have

in my Gentoo tree. Have you got a fix?

Thanks for this ebuild,
Helmut




Re: [gentoo-user] multi-region OCR

2016-11-30 Thread Landis Blackwell

Did you train tesseract per chance? And could I get some sample images?

Landis


On 11/30/2016 12:28 PM, Michael Mol wrote:

On Wednesday, November 30, 2016 05:34:25 PM J. Roeleveld wrote:

On November 30, 2016 6:03:36 PM GMT+01:00, Michael Mol 

wrote:

On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote:

On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote:

Michael Mol:
...


xsane would have let me do it during the scan process if I'd

thought of


it
then, but the scans are done, drives aren't there any more.

Something


...

If xsane solves your need why don't you just print your scans so

xsane


can do its job ?

There has to be a way to do this without killing an entire forest...

And big chunks of ink cartridges. The scans stretched the contrast so I
can
clearly read the drive labels through the translucent anti-static bags,
which
means a huge chunk of the image (what's outside the labels) is pure
black.

Which I could get around by spending fifteen minutes munging things in
the Gimp
before printing, but at that point, I may as well just transcribe
things
manually at that point.

Looking for something reasonably simple to improve the general
workflow. I'd
have hoped something would have already been available on Linux; it'd
be easy
enough to copy the scans to my phone and feed them through Google
Goggles for
the desired output, but then I'm deliberately filtering company data
through an
outside entity.

Did you manage to use that link I sent?

I did. tesseract almost worked, even separating the regions cleanly in its
output, but it seems, sadly, that the 300dpi scans were insufficient to get a
good read; lots of clear corruption of the text, so things like serial
numbers, model numbers, version numbers--everything you'd care about--would be
highly suspect.

The next tool that looked like it might work, gscan2pdf, wasn't in portage,
and with the semi-garbled output from tesseract suggesting the scans were too
poor quality, I didn't pursue further.






Re: [gentoo-user] multi-region OCR

2016-11-30 Thread Francisco Ares
2016-11-30 16:28 GMT-02:00 Michael Mol :

> On Wednesday, November 30, 2016 05:34:25 PM J. Roeleveld wrote:
> > On November 30, 2016 6:03:36 PM GMT+01:00, Michael Mol <
> mike...@gmail.com>
> wrote:
> > >On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote:
> > >> On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote:
> > >> > Michael Mol:
> > >> > ...
> > >> >
> > >> > > xsane would have let me do it during the scan process if I'd
> > >
> > >thought of
> > >
> > >> > > it
> > >> > > then, but the scans are done, drives aren't there any more.
> > >
> > >Something
> > >
> > >> > ...
> > >> >
> > >> > If xsane solves your need why don't you just print your scans so
> > >
> > >xsane
> > >
> > >> > can do its job ?
> > >>
> > >> There has to be a way to do this without killing an entire forest...
> > >
> > >And big chunks of ink cartridges. The scans stretched the contrast so I
> > >can
> > >clearly read the drive labels through the translucent anti-static bags,
> > >which
> > >means a huge chunk of the image (what's outside the labels) is pure
> > >black.
> > >
> > >Which I could get around by spending fifteen minutes munging things in
> > >the Gimp
> > >before printing, but at that point, I may as well just transcribe
> > >things
> > >manually at that point.
> > >
> > >Looking for something reasonably simple to improve the general
> > >workflow. I'd
> > >have hoped something would have already been available on Linux; it'd
> > >be easy
> > >enough to copy the scans to my phone and feed them through Google
> > >Goggles for
> > >the desired output, but then I'm deliberately filtering company data
> > >through an
> > >outside entity.
> >
> > Did you manage to use that link I sent?
>
> I did. tesseract almost worked, even separating the regions cleanly in its
> output, but it seems, sadly, that the 300dpi scans were insufficient to
> get a
> good read; lots of clear corruption of the text, so things like serial
> numbers, model numbers, version numbers--everything you'd care
> about--would be
> highly suspect.
>
> The next tool that looked like it might work, gscan2pdf, wasn't in portage,
> and with the semi-garbled output from tesseract suggesting the scans were
> too
> poor quality, I didn't pursue further.
>
> --
> :wq


Well, I've had similar issue. I had gimp to resize the image to its double
(width and height, of course), filtered it a bit (edge enhancement) and
split the image in several ones for the regions of interest.

Of course, there might be an easier way ;-)

Francisco


Re: [gentoo-user] multi-region OCR

2016-11-30 Thread Neil Bothwick
On Wed, 30 Nov 2016 13:28:15 -0500, Michael Mol wrote:

> The next tool that looked like it might work, gscan2pdf, wasn't in
> portage, and with the semi-garbled output from tesseract suggesting the
> scans were too poor quality, I didn't pursue further.

THere's an ebuild on bgo that I've kept updated to the latest release,
I've attached it. However, it uses tesseract as the OCR engine, so I
would expect similar results.


-- 
Neil Bothwick

Do Roman paramedics refer to IV's as "4's"?


gscan2pdf-1.5.5.ebuild
Description: Binary data


pgpa1iohJbryv.pgp
Description: OpenPGP digital signature


Re: [gentoo-user] multi-region OCR

2016-11-30 Thread Michael Mol
On Wednesday, November 30, 2016 05:34:25 PM J. Roeleveld wrote:
> On November 30, 2016 6:03:36 PM GMT+01:00, Michael Mol  
wrote:
> >On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote:
> >> On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote:
> >> > Michael Mol:
> >> > ...
> >> > 
> >> > > xsane would have let me do it during the scan process if I'd
> >
> >thought of
> >
> >> > > it
> >> > > then, but the scans are done, drives aren't there any more.
> >
> >Something
> >
> >> > ...
> >> > 
> >> > If xsane solves your need why don't you just print your scans so
> >
> >xsane
> >
> >> > can do its job ?
> >> 
> >> There has to be a way to do this without killing an entire forest...
> >
> >And big chunks of ink cartridges. The scans stretched the contrast so I
> >can
> >clearly read the drive labels through the translucent anti-static bags,
> >which
> >means a huge chunk of the image (what's outside the labels) is pure
> >black.
> >
> >Which I could get around by spending fifteen minutes munging things in
> >the Gimp
> >before printing, but at that point, I may as well just transcribe
> >things
> >manually at that point.
> >
> >Looking for something reasonably simple to improve the general
> >workflow. I'd
> >have hoped something would have already been available on Linux; it'd
> >be easy
> >enough to copy the scans to my phone and feed them through Google
> >Goggles for
> >the desired output, but then I'm deliberately filtering company data
> >through an
> >outside entity.
> 
> Did you manage to use that link I sent?

I did. tesseract almost worked, even separating the regions cleanly in its 
output, but it seems, sadly, that the 300dpi scans were insufficient to get a 
good read; lots of clear corruption of the text, so things like serial 
numbers, model numbers, version numbers--everything you'd care about--would be 
highly suspect.

The next tool that looked like it might work, gscan2pdf, wasn't in portage, 
and with the semi-garbled output from tesseract suggesting the scans were too 
poor quality, I didn't pursue further.

-- 
:wq

signature.asc
Description: This is a digitally signed message part.


Re: [gentoo-user] multi-region OCR

2016-11-30 Thread J. Roeleveld
On November 30, 2016 6:03:36 PM GMT+01:00, Michael Mol  
wrote:
>On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote:
>> On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote:
>> > Michael Mol:
>> > ...
>> > 
>> > > xsane would have let me do it during the scan process if I'd
>thought of
>> > > it
>> > > then, but the scans are done, drives aren't there any more.
>Something
>> > 
>> > ...
>> > 
>> > If xsane solves your need why don't you just print your scans so
>xsane
>> > can do its job ?
>> 
>> There has to be a way to do this without killing an entire forest...
>
>And big chunks of ink cartridges. The scans stretched the contrast so I
>can 
>clearly read the drive labels through the translucent anti-static bags,
>which 
>means a huge chunk of the image (what's outside the labels) is pure
>black.
>
>Which I could get around by spending fifteen minutes munging things in
>the Gimp 
>before printing, but at that point, I may as well just transcribe
>things 
>manually at that point.
>
>Looking for something reasonably simple to improve the general
>workflow. I'd 
>have hoped something would have already been available on Linux; it'd
>be easy 
>enough to copy the scans to my phone and feed them through Google
>Goggles for 
>the desired output, but then I'm deliberately filtering company data
>through an 
>outside entity.

Did you manage to use that link I sent?

--
Joost
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [gentoo-user] multi-region OCR

2016-11-30 Thread Michael Mol
On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote:
> On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote:
> > Michael Mol:
> > ...
> > 
> > > xsane would have let me do it during the scan process if I'd thought of
> > > it
> > > then, but the scans are done, drives aren't there any more. Something
> > 
> > ...
> > 
> > If xsane solves your need why don't you just print your scans so xsane
> > can do its job ?
> 
> There has to be a way to do this without killing an entire forest...

And big chunks of ink cartridges. The scans stretched the contrast so I can 
clearly read the drive labels through the translucent anti-static bags, which 
means a huge chunk of the image (what's outside the labels) is pure black.

Which I could get around by spending fifteen minutes munging things in the Gimp 
before printing, but at that point, I may as well just transcribe things 
manually at that point.

Looking for something reasonably simple to improve the general workflow. I'd 
have hoped something would have already been available on Linux; it'd be easy 
enough to copy the scans to my phone and feed them through Google Goggles for 
the desired output, but then I'm deliberately filtering company data through an 
outside entity.

-- 
:wq

signature.asc
Description: This is a digitally signed message part.


Re: [gentoo-user] multi-region OCR

2016-11-30 Thread J. Roeleveld
On Tuesday, November 29, 2016 01:33:48 PM Michael Mol wrote:
> So, I've got scans of a half dozen new hard drives, and I've got scans of
> their labels. One image has two drives, the other has four.
> 
> Rather than manually transcribing the label contents into my intake ticket,
> I'd like to select a region of each image and OCR it. (Darn, it'd be handy
> if they put all this metadata into a QR code...)
> 
> What tools exist to let me do this? Keep in mind, I've got multiple regions
> I need to OCR, and the regions aren't going to be consistent across images.
> 
> xsane would have let me do it during the scan process if I'd thought of it
> then, but the scans are done, drives aren't there any more. Something
> reasonably similar would be nice. Okular is reputed to have some OCR
> capability, but I can't find it. Dolphin is supposed to be able to do it if
> you have tesserract installed (I do), but I can't find the service to
> enable. I could use some pointers...

Quick search:

https://help.ubuntu.com/community/OCR

This contains some example-scripts for several OCR tools.

--
Joost

PS. I used a similar approach once to fix a PDF from an HR-department to enable 
searching. They typed a document in MS Word, printed it, then scanned it into 
a PDF... Merging the PDF with the OCR-results was quite nice as well



Re: [gentoo-user] multi-region OCR

2016-11-30 Thread J. Roeleveld
On Tuesday, November 29, 2016 11:18:36 PM k...@aspodata.se wrote:
> Michael Mol:
> ...
> 
> > xsane would have let me do it during the scan process if I'd thought of it
> > then, but the scans are done, drives aren't there any more. Something
> 
> ...
> 
> If xsane solves your need why don't you just print your scans so xsane
> can do its job ?

There has to be a way to do this without killing an entire forest...

--
Joost



Re: [gentoo-user] multi-region OCR

2016-11-29 Thread karl
Michael Mol:
...
> xsane would have let me do it during the scan process if I'd thought of it 
> then, but the scans are done, drives aren't there any more. Something 
...

If xsane solves your need why don't you just print your scans so xsane 
can do its job ?

Regards,
/Karl Hammar

---
Aspö Data
Lilla Aspö 148
S-742 94 Östhammar
Sweden
+46 173 140 57





[gentoo-user] multi-region OCR

2016-11-29 Thread Michael Mol
So, I've got scans of a half dozen new hard drives, and I've got scans of 
their labels. One image has two drives, the other has four.

Rather than manually transcribing the label contents into my intake ticket, 
I'd like to select a region of each image and OCR it. (Darn, it'd be handy if 
they put all this metadata into a QR code...)

What tools exist to let me do this? Keep in mind, I've got multiple regions I 
need to OCR, and the regions aren't going to be consistent across images. 

xsane would have let me do it during the scan process if I'd thought of it 
then, but the scans are done, drives aren't there any more. Something 
reasonably similar would be nice. Okular is reputed to have some OCR 
capability, but I can't find it. Dolphin is supposed to be able to do it if you 
have tesserract installed (I do), but I can't find the service to enable. I 
could use some pointers...


-- 
:wq

signature.asc
Description: This is a digitally signed message part.