#149: BibClassify: keyword tag cloud and daemon improvements
--------------------------+--------------------
  Reporter:  simko        |      Owner:  rchyla
      Type:  task         |     Status:  closed
  Priority:  major        |  Milestone:  v1.0
 Component:  BibClassify  |    Version:
Resolution:  fixed        |   Keywords:
--------------------------+--------------------
Changes (by Roman Chyla <roman.chyla@…>):

 * status:  in_merge => closed
 * resolution:   => fixed


Comment:

 In [62eaad18a0dbd19fa23ca12549f4f1dcf400c335]:
 {{{
 #!CommitTicketReference repository=""
 revision="62eaad18a0dbd19fa23ca12549f4f1dcf400c335"
 BibClassify: UI improvements and refactoring

 * Refactoring changes:
   * instead of trying to load components, we leave the job to python -
     just make sure correct python path is set
   * write into the etc directory if running in a standalone mode
   * making import relative
   * improving bibclassify_config.py to find config dir and test
     writability
   * also refactoring, moving the acronym extraction call into the core
     function
   * fixing inconsistency between arguments, _output_marc was expecting
     dictionary of kws, _output_html and _output_text list of kws
   * marc is able to work with different types of keywords (stages
     possible); needs some workflow
   * brought more api calls inside bibclassify engine

 * Output changes:
   * author keywords were not formatted properly for the output, now
     fixed
   * display the count for the found core keywords - helpful for human
     decisions
   * fieldcodes printed now during ouput
   * added bibclassify signature to the output
   * simplified and refactored the output for html, txt, marc
   * core keywords are now printed no matter if they are in the limit
     range
   * outputting core keywords, if they are part of the composites
   * web interface now distinguishes between different types of
     keywords
   * DESY keywords (or other fiels) are now displayed alongside with
     the other keywords
   * moved the local css from bibclassify_config to
     webstyle/css/invenio.css
   * (closes #149)

 * Cache reloading and invalidation fixes:
   * fixed the bug when cache was being always rebuilt
   * incremented version number
   * invalidate generated docs (var/tmp/bibclassify/bibclassify_*.xml)
     by lazy-deleting the files if already saved in the database
   * checked reuse of the same cache between threads - using
     thread.Lock()
   * (closes #49)

 * Making bibclassify more secure and other small changes:
  * kw generations now goes through bibsched
  * generation might be associated with certain user roles
  * escape added to kw args received from web
  * added docstrings to tests
  * relative path resolution (automatic) to microtests
  * limiting number of keywords output
  * use_task_low_level_submission to upload keywords
  * removed mysqldb.escape_string
  * ui messages improvements

 * Workflow improvements and various prettifications:
   * before extraction is scheduled, we now check much more conditions,
     before allowing the run, appropriate messages are generated
   * improved css, moved css to invenio
   * make config use 6531_ syntax for main and other marc fields
   * when exporting, use data from the database rather than from the
     generated files (if they still exist)
   * when no weights are available, make tagcloud use minimal size,
     rather than maximum size fonts
   * fixed bug for searching inside DB for taxonomy name
   * improved local_config options, to override settings locally

 * Bugs fixed:
   * Fixing bug where KeywordToken.output() spits out label instead of
     prefLabel; it was a problem in instance.spires initialization
   * option extract-acronyms was not honoured
   * fixed bug when only single keywords were considered as composite
     parts of the composite kw, but in fact, composite kw can be made
     of other composite kw (now, if bibc reports that there are missing
     kw, it means the concepts are not defined neither as single, nor
     as composite kw -- ie. error in taxonomy
   * fixed bibclassify_engine.py when generating marcxml
   * prettified interface
   * moving css to <style>
   * prevent whitespace breaks for kwsvim
   * fixed bug: exchnanged skw with ckw
   * added __hash__ method to the KeywordToken which is important to
     ensure kw objects are identified by their concept string
   * text_extractor was never using pdftotext in one call
   * bibclassify_ontology_reader.py
     * single extracted keywords were truncated before being sent to
       composite kw extraction this resulted in less comp kws found
     * also added more checks for cache existence,writability,
       readability etc
     * fix of a invisible slow-down by rdflib graph, when the store
       object is used and evaluated as boolean (but in fact it is not a
       fast operation)
   * bibclassify_tests.py
     * a lot of new tests checking for cache (re)build-up

 * Various:
   * make bibclassify_acronym_analyzer.py to accept lower case letters
     as acronyms; this is a bit controversial change, because now the
     acronym is anything as a letter inside brackets, so (dS), (DS),
     (Ds) which is preceded by the acronym expansion which follows
     "d-s", ie. dynamic syntax (ds)
   * updated bibclassify_tests.py to use pdfs from the demo collection
   * inherit fieldcode information from components
   * added info about no of matches (for inherited core keywords)
   * fixed a bug when composite keywords were not found because single
     kws were filtered/truncated before passing them to the comp-kw
     search
   * moved the regex compilation (20% speed increase on bulk
     processing)
   * fixed two old regex patterns that introduces a lot of noise into
     the normalized text
   * (re)added --only-core-tags option
   * changed config of the patterns to match multi-kws that span
     several lines
 }}}

-- 
Ticket URL: </ticket/149#comment:3>
Invenio <http://invenio-software.org>

Reply via email to