This is an automated notification sent by LCG Savannah.
It relates to:
                task #9008, project CDS Invenio

==============================================================================
 OVERVIEW of task #9008:
==============================================================================

URL:
  <http://savannah.cern.ch/task/?9008>

                 Summary: fulltext indexing should support MIME and/or magic
                 Project: CDS Invenio
            Submitted by: simko
            Submitted on: 2009-02-11 16:35
         Should Start On: 2009-02-11 00:00
   Should be Finished on: 2009-02-11 00:00
                Category: BibIndex
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
        Percent Complete: 0%
             Assigned to: skaplun
             Open/Closed: Open
         Discussion Lock: Any
                  Effort: 0.00

    _______________________________________________________


When fulltext-indexing remote files, the indexer currently relies
on the detection of the file name extension in order to decide what kind of
fulltext file type the remote source is, and which
converter to text to call.

This approach is fine for DOC/setlink, but is not enough for
remote URLs of the following kind:
*
<http://indico.cern.ch/materialDisplay.py?contribId=116sessionId=0&materialId=slides&confId=46024>
* <http://arxiv.org/pdf/0902.1743>

The fulltext indexer should analyse file extension, and if it is not among
the known file formats, it should guess file extension either from the MIME
type response header from the remote site 
when downloading the file, or should analyse proposed filename coming from
the remote server, or should use the magic library to discover the file type
of the downloaded file. Then a proper 
format-to-text converter program can be called.

(This latter can be done for direct URLs, not for indirect URLs leading to
splash pages.)



    _______________________________________________________

Carbon-Copy List:

CC Address                          | Comment
------------------------------------+-----------------------------
1576                                | -SUB-




==============================================================================

This item URL is:
  <http://savannah.cern.ch/task/?9008>

_______________________________________________
  Message sent via/by LCG Savannah
  http://savannah.cern.ch/


Reply via email to