This is an automated notification sent by LCG Savannah.
It relates to:
                task #6474, project CDS Invenio

==============================================================================
 LATEST MODIFICATIONS of task #6474:
==============================================================================

Update of task #6474 (project cdsware):

                Category:                    None => WebSubmit              


==============================================================================
 OVERVIEW of task #6474:
==============================================================================

URL:
  <http://savannah.cern.ch/task/?6474>

                 Summary: Centralizing word extraction
                 Project: CDS Invenio
            Submitted by: skaplun
            Submitted on: 2008-02-25 08:31
         Should Start On: 2008-02-25 00:00
   Should be Finished on: 2008-02-25 00:00
                Category: WebSubmit
                Priority: 3 - Low
                  Status: Done
                 Privacy: Public
        Percent Complete: 100%
             Assigned to: skaplun
             Open/Closed: Closed
         Discussion Lock: Any
                  Effort: 0.00

    _______________________________________________________


BibIndex, BibClassify, RefExtract, BibRank with word ranking, all need to
convert a document into a stream of word via pdf tools. It worth do this only
once and cache the extracted document in a zipped way just next to the
different revision of the document.
some_document.pdf;1
some_document.ps.gz;1
.text_in_some_document;1

A centralized api for this would be needed.

    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: 2008-11-19 17:54              By: Samuele Kaplun <skaplun>
API completed, and text extraction embedded in bibindex -w fulltext.
BibClassify, RefExtract, etc. could now just call:

experimental API is:
x = BibDoc(docid)
if not x.has_text():
    x.extract_text()
text = x.get_text()

this will abstract from whatever formats are contained in the bibdoc (pdf,
ps.gz, doc...) and will return a stream of words coming from the document
(possibly implying OCR)


-------------------------------------------------------
Date: 2008-11-05 16:19              By: Samuele Kaplun <skaplun>
Part of the new conversion-tool library.





    _______________________________________________________

Carbon-Copy List:

CC Address                          | Comment
------------------------------------+-----------------------------
1576                                | -UPD-
2195                                | -SUB-




==============================================================================

This item URL is:
  <http://savannah.cern.ch/task/?6474>

_______________________________________________
  Message sent via/by LCG Savannah
  http://savannah.cern.ch/

Reply via email to