Re: [Dspace-tech] standards to facilitate metadata extraction duringtext extraction

Andrew Marlow Mon, 15 Dec 2008 11:37:53 -0800

On Mon, Dec 15, 2008 at 9:36 AM, Robin Taylor <robin.tay...@ed.ac.uk> wrote:


> I don't think it's a daft question at all, but then I am known to ask some
> very daft ones myself :)
>
> I think the problem is that we wrap the data up in formats that make
> extraction difficult and then need to go to great lengths to try and extract
> that data. I don't know of any widely used, reliable methos as yet. Better
> to move towards formats that make extraction easy. Microsoft docx documents
> looks like a step in the right direction to me.


No, no, no, please let us not use formats invented by Microsoft. We need
open formats not closed-secret-proprietary ones. And if Microsoft claim it
is open we must not believe them. Just look at their track record. I realise
that PDFs are not completely open either but they are bound to be more open
than anything Microsoft produce. And I was talking about PDFs.

But I do not want the discussion to focus on file formats. As I said
originally,

> But if a society had
> conventions for the layout of the article, specifying
> location and format of title, authors, abstract, bibliography
> etc, then it might be possible


> -----Original Message-----
> > From: Andrew Marlow [mailto:marlow.and...@googlemail.com]
> > Sent: 13 December 2008 23:53
> > To: dspace-tech@lists.sourceforge.net
> > Subject: [Dspace-tech] standards to facilitate metadata
> > extraction duringtext extraction
> >
> > This may seem like a crazy or naive question, but is there
> > any standard laid down by publishers or societies that
> > authors must adhere to so that the extraction of metadata
> > from articles can be easily automated? Having just performed
> > a text extraction on a non-searchable PDF I see that there is
> > no easy way to get any metadata out. But if a society had
> > conventions for the layour of the article, specifying
> > location and format of title, authors, abstract, bibliography
> > etc, then it might be possible. I have seen a very regular
> > visual layout in the PDFs from some places. Using OCR
> > techniques it might be possible to locate blocks of interest.
> > It might also be possible from a text extraction but that
> > might be harder since all visual layout information is gone
> > (at least it was with the tool I used). I wonder if this is
> > being considered by anyone.
>
-- 
Regards,

Andrew M.

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] standards to facilitate metadata extraction duringtext extraction

Reply via email to