This is an automated notification sent by LCG Savannah.
It relates to:
task #6474, project CDS Invenio
==============================================================================
LATEST MODIFICATIONS of task #6474:
==============================================================================
Update of task #6474 (project cdsware):
Category: None => WebSubmit
==============================================================================
OVERVIEW of task #6474:
==============================================================================
URL:
<http://savannah.cern.ch/task/?6474>
Summary: Centralizing word extraction
Project: CDS Invenio
Submitted by: skaplun
Submitted on: 2008-02-25 08:31
Should Start On: 2008-02-25 00:00
Should be Finished on: 2008-02-25 00:00
Category: WebSubmit
Priority: 3 - Low
Status: Done
Privacy: Public
Percent Complete: 100%
Assigned to: skaplun
Open/Closed: Closed
Discussion Lock: Any
Effort: 0.00
_______________________________________________________
BibIndex, BibClassify, RefExtract, BibRank with word ranking, all need to
convert a document into a stream of word via pdf tools. It worth do this only
once and cache the extracted document in a zipped way just next to the
different revision of the document.
some_document.pdf;1
some_document.ps.gz;1
.text_in_some_document;1
A centralized api for this would be needed.
_______________________________________________________
Follow-up Comments:
-------------------------------------------------------
Date: 2008-11-19 17:54 By: Samuele Kaplun <skaplun>
API completed, and text extraction embedded in bibindex -w fulltext.
BibClassify, RefExtract, etc. could now just call:
experimental API is:
x = BibDoc(docid)
if not x.has_text():
x.extract_text()
text = x.get_text()
this will abstract from whatever formats are contained in the bibdoc (pdf,
ps.gz, doc...) and will return a stream of words coming from the document
(possibly implying OCR)
-------------------------------------------------------
Date: 2008-11-05 16:19 By: Samuele Kaplun <skaplun>
Part of the new conversion-tool library.
_______________________________________________________
Carbon-Copy List:
CC Address | Comment
------------------------------------+-----------------------------
1576 | -UPD-
2195 | -SUB-
==============================================================================
This item URL is:
<http://savannah.cern.ch/task/?6474>
_______________________________________________
Message sent via/by LCG Savannah
http://savannah.cern.ch/