lesman escribió: > lesman escribió: >> Ernesto Escobar escribió: >>> Lesman: > >> Google una vez más >> Forbidden >> Your client does not have permission to get URL /p/tesseract-ocr/ from >> this server. >> >> Parece que lo mejorcito es esto: >> >> Tesseract OCR >> De Wikipedia, la enciclopedia libre >> Saltar a navegación, búsqueda >> >> Es el programa de Tecnología OCR creada por Hewlett Packard Laboratories >> entre 1985 y 1995. En 1995 obtubo el lugar 3 entre los de mejor >> comportamiento en la contienda preparada por la UNLV(University of >> Nevada in Las Vegas). >> >> La utiliza google para buscar texto en los libros que ya no tienen >> CopyRight y así brindarle solución a los usuarios. >> >> Lierado en 2005 bajo la licencia de OpenSource. >> >> >> Limitaciones [editar] >> >> Solo Reconoce Lenguaje Inglés >> >> >> Enlaces externos [editar] >> >> Dirección en SourceForge [1] > > aunque en ubuntu hardy aparece esto: > > tesseract-ocr language files for Spanish text > A commercial quality OCR engine originally developed at HP between 1985 > and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It > was open-sourced by HP and UNLV in 2005. This package contains the data > needed for processing images in Spanish. > > Homepage: http://code.google.com/p/tesseract-ocr/ > > Solo queda probar y comparar... > ______________________________________
y por último un howto: Optical Character Recognition With Tesseract OCR On Ubuntu 7.04 Do you like HowtoForge? Please consider to support us by becoming a subscriber. Submitted by o.meyer (Contact Author) (Forums) on Tue, 2007-08-28 16:56. :: Ubuntu | Desktop Optical Character Recognition With Tesseract OCR On Ubuntu 7.04 Version 1.0 Author: Oliver Meyer <o [dot] meyer [at] projektfarm [dot] de> Last edited 08/23/2007 This document describes how to set up Tesseract OCR on Ubuntu 7.04. OCR means "Optical Character Recognition". The resulting system will be able to convert images with embedded text to text files. Tesseract is licensed under the Apache License v2.0. This howto is meant as a practical guide; it does not cover the theoretical backgrounds. They are treated in a lot of other documents in the web. This document comes without warranty of any kind! I want to say that this is not the only way of setting up such a system. There are many ways of achieving this goal but this is the way I take. I do not issue any guarantee that this will work for you! 1 Preparation Set up a basic Ubuntu 7.04 system and update it. Get scanned images or scan documents yourself. If you use a scanner, be sure that it is supported by sane. A list of supported devices is vailable at http://www.sane-project.org/. 2 Get Imagemagick The current version of tesseract provided in the Ubuntu repositories supports only uncompressed and G3-compressed tiff files. To ensure, that tesseract is able to process your images, you should convert them to uncompressed tiff. Since conversions with Gimp to uncompressed tiff were unusable, I used the convert tool, which is supplied by the Imagemagick package. Install Imagemagick from the Ubuntu repositories with the Synaptic Package Manager. 3 Get Tesseract Install the packages tesseract-ocr and tesseract-ocr-data from the Ubuntu repositories with the Synaptic Package Manager. 4 Prepare Images To get the best results from tesseract, you have to optimize the images. I recommend the use of images with a minimum resolution of about 200dpi. I used Gimp for the following steps 4.1 - 4.3. 4.1 Cleaning Remove any non-alphanumeric content from the image to prevent tesseract from producing chaotic text blocks. That can be done easily with the erase-tool within Gimp. 4.2 Threshold Convert the image to RGB or Greyscale mode. Within gimp: Image - Mode - RGB or Grayscale Use the threshold function to reduce biased lighting and remove fragments. Move the sliders to define the delimitation of bright and dark areas. Have a look at the preview while you are doing this to see the effects on the image. Within Gimp: Tools - Color Tools - Threshold Click to enlarge 4.3 Black And White To improve the text recognition, we reduce the colors to black an white by switching the image to indexed mode. Within Gimp: Image - Mode - Indexed Click to enlarge Be sure to turn off dithering. Save the image after this step. 5 Convert To Tiff Now you have to convert the image to uncompressed tiff. convert %source_file% %destination_file% e.g.: convert document.jpg document.tif 6 Use Tesseract At this point all preparations are completed, so you can start using tesseract. tesseract %tiff_file% %name_for_resulting_files% e.g.: tesseract document.tif result Tesseract adds the file extensions for the resulting files itself. In this example tesseract would create result.txt, result.map and result.raw . Links * Tesseract: http://sourceforge.net/projects/tesseract-ocr * Sane: http://www.sane-project.org/ * Ubuntu: http://www.ubuntu.com/ Después me dicen los resultados con texto en español a ver si puedo dar la solución completa para las bibliotecas: Koha + tesseract-ocr saludos Lesman _______________________________________________ Cancelar suscripción https://listas.softwarelibre.cu/mailman/listinfo/linux-l Buscar en el archivo http://listas.softwarelibre.cu/buscar/linux-l
