#197: Errors in text extraction without running 'make install-pdfa-helper-files'
------------------------+---------------------------------------------------
  Reporter:  bthiell    |       Owner:  skaplun 
      Type:  defect     |      Status:  assigned
  Priority:  major      |   Milestone:  v1.0    
 Component:  WebSubmit  |     Version:          
Resolution:             |    Keywords:          
------------------------+---------------------------------------------------

Comment (by skaplun):

 The real problem is:

 what can it be a generic way to let the admin decide which fulltext to
 extract text from (with  or without OCR)? If such an answer is found we
 are done :-)

 This decision can not be based on collections as these are constructed by
 webcoll based on bibindex execution. Since bibindex is also responsible to
 extract text to index it must decide whether to extract text from fulltext
 before webcoll.

 A good and simple solution might be center everything on the doctype,
 e.g.:
 CFG_BIBINDEX_EXTRACT_TEXT_FROM_FULLTEXT_DEFAULT = True
 CFG_BIBINDEX_EXTRACT_TEXT_FROM_FULLTEXT_VIA_OCR_DEFAULT = False
 CFG_BIBINDEX_EXTRACT_TEXT_FROM_FULLTEXT_WITH_DOCTYPE = Main,Slides
 CFG_BIBINDEX_EXTRACT_TEXT_FROM_FULLTEXT_WITHOUT_DOCTYPE = Figure,Icon
 CFG_BIBINDEX_EXTRACT_TEXT_FROM_FULLTEXT_WITH_DOCTYPE_WITH_OCR = Scan

 Any better proposal? (of course the best thing would be to build another
 language similar to search syntax, based on doctype, docname, status,
 format etc. would it be possible to recycle part of the search engine for
 performing such parsing? E.g. "docname:Scan* AND NOT doctype:Figure"?)

-- 
Ticket URL: <http://invenio-software.org/ticket/197#comment:5>
Invenio <http://invenio-software.org>

Reply via email to