#197: Errors in text extraction without running 'make install-pdfa-helper-files'
------------------------+---------------------------------------------------
Reporter: bthiell | Owner: skaplun
Type: defect | Status: assigned
Priority: major | Milestone: v1.0
Component: WebSubmit | Version:
Resolution: | Keywords:
------------------------+---------------------------------------------------
Comment (by skaplun):
The real problem is:
what can it be a generic way to let the admin decide which fulltext to
extract text from (with or without OCR)? If such an answer is found we
are done :-)
This decision can not be based on collections as these are constructed by
webcoll based on bibindex execution. Since bibindex is also responsible to
extract text to index it must decide whether to extract text from fulltext
before webcoll.
A good and simple solution might be center everything on the doctype,
e.g.:
CFG_BIBINDEX_EXTRACT_TEXT_FROM_FULLTEXT_DEFAULT = True
CFG_BIBINDEX_EXTRACT_TEXT_FROM_FULLTEXT_VIA_OCR_DEFAULT = False
CFG_BIBINDEX_EXTRACT_TEXT_FROM_FULLTEXT_WITH_DOCTYPE = Main,Slides
CFG_BIBINDEX_EXTRACT_TEXT_FROM_FULLTEXT_WITHOUT_DOCTYPE = Figure,Icon
CFG_BIBINDEX_EXTRACT_TEXT_FROM_FULLTEXT_WITH_DOCTYPE_WITH_OCR = Scan
Any better proposal? (of course the best thing would be to build another
language similar to search syntax, based on doctype, docname, status,
format etc. would it be possible to recycle part of the search engine for
performing such parsing? E.g. "docname:Scan* AND NOT doctype:Figure"?)
--
Ticket URL: <http://invenio-software.org/ticket/197#comment:5>
Invenio <http://invenio-software.org>