#1013: Check for gibberish in references before accepting them
-------------------------------------------+-------------------------
Reporter: adeiana | Owner: adeiana
Type: enhancement | Status: new
Priority: minor | Component: DocExtract
Version: | Resolution:
Keywords: garbage pdftotext pdf2text OCR |
-------------------------------------------+-------------------------
Changes (by skaplun):
* keywords: => garbage pdftotext pdf2text OCR
Comment:
Hi Alessio,
this is a nice feature that would be nice if it was factored-out and
available upon the general textification process. Indeed we don't have yet
an heuristic on what is garbage coming out from pdftotext.
If you implement such an heuristic it would be nice it was made in a
generic way, and put e.g. in textutils or in bibdocfile, so that also
BibIndex avoid indexing garbage.
Cheers!
Sam
--
Ticket URL: <http://invenio-software.org/ticket/1013#comment:1>
Invenio <http://invenio-software.org>