Re: [linux-l] Reconocimiento Óptico de Caracteres (OCR)

lesman Tue, 17 Feb 2009 10:36:10 -0800

lesman escribió:
> lesman escribió:
>> Ernesto Escobar escribió:
>>> Lesman:
> 
>> Google una vez más
>> Forbidden
>> Your client does not have permission to get URL /p/tesseract-ocr/ from 
>> this server.
>>
>> Parece que lo mejorcito es esto:
>>
>> Tesseract OCR
>> De Wikipedia, la enciclopedia libre
>> Saltar a navegación, búsqueda
>>
>> Es el programa de Tecnología OCR creada por Hewlett Packard Laboratories 
>> entre 1985 y 1995. En 1995 obtubo el lugar 3 entre los de mejor 
>> comportamiento en la contienda preparada por la UNLV(University of 
>> Nevada in Las Vegas).
>>
>> La utiliza google para buscar texto en los libros que ya no tienen 
>> CopyRight y así brindarle solución a los usuarios.
>>
>> Lierado en 2005 bajo la licencia de OpenSource.
>>
>>
>> Limitaciones [editar]
>>
>> Solo Reconoce Lenguaje Inglés
>>
>>
>> Enlaces externos [editar]
>>
>> Dirección en SourceForge [1]
> 
> aunque en ubuntu hardy aparece esto:
> 
> tesseract-ocr language files for Spanish text
> A commercial quality OCR engine originally developed at HP between 1985
> and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It
> was open-sourced by HP and UNLV in 2005. This package contains the data
> needed for processing images in Spanish.
> 
> Homepage: http://code.google.com/p/tesseract-ocr/
> 
> Solo queda probar y comparar...
> ______________________________________



y por último un howto:

Optical Character Recognition With Tesseract OCR On Ubuntu 7.04
Do you like HowtoForge? Please consider to support us by becoming a 
subscriber.
Submitted by o.meyer (Contact Author) (Forums) on Tue, 2007-08-28 16:56. 
:: Ubuntu | Desktop
Optical Character Recognition With Tesseract OCR On Ubuntu 7.04

Version 1.0
Author: Oliver Meyer <o [dot] meyer [at] projektfarm [dot] de>
Last edited 08/23/2007

This document describes how to set up Tesseract OCR on Ubuntu 7.04. OCR 
means "Optical Character Recognition". The resulting system will be able 
to convert images with embedded text to text files. Tesseract is 
licensed under the Apache License v2.0.

This howto is meant as a practical guide; it does not cover the 
theoretical backgrounds. They are treated in a lot of other documents in 
the web.

This document comes without warranty of any kind! I want to say that 
this is not the only way of setting up such a system. There are many 
ways of achieving this goal but this is the way I take. I do not issue 
any guarantee that this will work for you!


1 Preparation

Set up a basic Ubuntu 7.04 system and update it.

Get scanned images or scan documents yourself.

If you use a scanner, be sure that it is supported by sane. A list of 
supported devices is vailable at http://www.sane-project.org/.


2 Get Imagemagick

The current version of tesseract provided in the Ubuntu repositories 
supports only uncompressed and G3-compressed tiff files.

To ensure, that tesseract is able to process your images, you should 
convert them to uncompressed tiff.

Since conversions with Gimp to uncompressed tiff were unusable, I used 
the convert tool, which is supplied by the Imagemagick package.

Install Imagemagick from the Ubuntu repositories with the Synaptic 
Package Manager.


3 Get Tesseract

Install the packages tesseract-ocr and tesseract-ocr-data from the 
Ubuntu repositories with the Synaptic Package Manager.


4 Prepare Images

To get the best results from tesseract, you have to optimize the images. 
I recommend the use of images with a minimum resolution of about 200dpi.

I used Gimp for the following steps 4.1 - 4.3.


4.1 Cleaning

Remove any non-alphanumeric content from the image to prevent tesseract 
from producing chaotic text blocks.

That can be done easily with the erase-tool within Gimp.


4.2 Threshold

Convert the image to RGB or Greyscale mode.

Within gimp:

Image - Mode - RGB or Grayscale

Use the threshold function to reduce biased lighting and remove 
fragments. Move the sliders to define the delimitation of bright and 
dark areas. Have a look at the preview while you are doing this to see 
the effects on the image.

Within Gimp:

Tools - Color Tools - Threshold

Click to enlarge


4.3 Black And White

To improve the text recognition, we reduce the colors to black an white 
by switching the image to indexed mode.

Within Gimp:

Image - Mode - Indexed

Click to enlarge

Be sure to turn off dithering.

Save the image after this step.


5 Convert To Tiff

Now you have to convert the image to uncompressed tiff.

convert %source_file% %destination_file%

e.g.:

convert document.jpg document.tif


6 Use Tesseract

At this point all preparations are completed, so you can start using 
tesseract.

tesseract %tiff_file% %name_for_resulting_files%

e.g.:

tesseract document.tif result

Tesseract adds the file extensions for the resulting files itself. In 
this example tesseract would create result.txt, result.map and result.raw .


Links

     * Tesseract: http://sourceforge.net/projects/tesseract-ocr
     * Sane: http://www.sane-project.org/
     * Ubuntu: http://www.ubuntu.com/


Después me dicen los resultados con texto en español a ver si puedo dar 
la solución completa para las bibliotecas: Koha + tesseract-ocr

saludos
Lesman
_______________________________________________
Cancelar suscripción
https://listas.softwarelibre.cu/mailman/listinfo/linux-l
Buscar en el archivo
http://listas.softwarelibre.cu/buscar/linux-l

Re: [linux-l] Reconocimiento Óptico de Caracteres (OCR)

Responder a