During my PhD, this was still a research subject (automatic extraction of
data from physical structure of a document).
Have a look at http://www.loria.fr/equipes/read/
I don't know whether there have been free or proprietary systems since then.

When the layout of your documents is a regular one, some rather simple
process may be useful, but if it varies too much, it is a much more
complicated task!
--
François PARMENTIER / INIST-CNRS

On Sun, Dec 14, 2008 at 12:52 AM, Andrew Marlow <
marlow.and...@googlemail.com> wrote:

> This may seem like a crazy or naive question, but is there any standard
> laid down by publishers or societies that authors must adhere to so that the
> extraction of metadata from articles can be easily automated? Having just
> performed a text extraction on a non-searchable PDF I see that there is no
> easy way to get any metadata out. But if a society had conventions for the
> layour of the article, specifying location and format of title, authors,
> abstract, bibliography etc, then it might be possible. I have seen a very
> regular visual layout in the PDFs from some places. Using OCR techniques it
> might be possible to locate blocks of interest. It might also be possible
> from a text extraction but that might be harder since all visual layout
> information is gone (at least it was with the tool I used). I wonder if this
> is being considered by anyone. I am very new to this area so please excuse
> me if this seems like a silly question.
> --
> Regards,
>
> Andrew M.
>
>
> ------------------------------------------------------------------------------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
>
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to