End of the 1990s, I used MS-Word forms and macros to allow authors to enter metadata together with their articles. Even references were structured.

It seemed a good idea (normalizing upfront).

It ended up very badly because:
* MacIntosh MS-Word was not compatible for forms and macros;
* Word Perfect was still popular and presented as being compatible (which was not true for forms and macros); The worse was one of the revisors who opened most of the articles in Word Perfect and saved them after comments addition... * Asian versions of Word were introducing unknown characters for Western versions;
* About a quarter of the authors did not understood the form.
Those (technical?) problems produced a terrible mess which took very long to correct and delayed the publication of the paper.

Efficient cataloguers (possibly with the help of a submission form like the DSpace one + a better cataloguing form than the current one) will be always better than machine to tame the authors' "diversity"!

Have a nice day!

Christophe Dupriez

François Parmentier a écrit :
During my PhD, this was still a research subject (automatic extraction of data from physical structure of a document).
Have a look at http://www.loria.fr/equipes/read/
I don't know whether there have been free or proprietary systems since then.

When the layout of your documents is a regular one, some rather simple process may be useful, but if it varies too much, it is a much more complicated task!
--
François PARMENTIER / INIST-CNRS

On Sun, Dec 14, 2008 at 12:52 AM, Andrew Marlow <marlow.and...@googlemail.com <mailto:marlow.and...@googlemail.com>> wrote:

    This may seem like a crazy or naive question, but is there any
    standard laid down by publishers or societies that authors must
    adhere to so that the extraction of metadata from articles can be
    easily automated? Having just performed a text extraction on a
    non-searchable PDF I see that there is no easy way to get any
    metadata out. But if a society had conventions for the layour of
    the article, specifying location and format of title, authors,
    abstract, bibliography etc, then it might be possible. I have seen
    a very regular visual layout in the PDFs from some places. Using
    OCR techniques it might be possible to locate blocks of interest.
    It might also be possible from a text extraction but that might be
    harder since all visual layout information is gone (at least it
    was with the tool I used). I wonder if this is being considered by
    anyone. I am very new to this area so please excuse me if this
    seems like a silly question.
-- Regards,

    Andrew M.

    
------------------------------------------------------------------------------
    SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las
    Vegas, Nevada.
    The future of the web can't happen without you.  Join us at MIX09
    to help
    pave the way to the Next Web now. Learn more and register at
    http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
    _______________________________________________
    DSpace-tech mailing list
    DSpace-tech@lists.sourceforge.net
    <mailto:DSpace-tech@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/dspace-tech


------------------------------------------------------------------------

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
------------------------------------------------------------------------

_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

begin:vcard
fn:Christophe Dupriez
n:Dupriez;Christophe
org:DESTIN inc. SSEB
adr;quoted-printable:;;rue des Palais 44, bo=C3=AEte 1;Bruxelles;;B-1030;Belgique
email;internet:christophe.dupr...@destin.be
title:Informaticien
tel;work:+32/2/216.66.15
tel;fax:+32/2/242.97.25
tel;cell:+32/475.77.62.11
note;quoted-printable:D=C3=A9veloppement de Syst=C3=A8mes de Traitement de l'Information
x-mozilla-html:TRUE
url:http://www.destin.be
version:2.1
end:vcard

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to