Re: [Dspace-tech] standards to facilitate metadata extraction duringtext extraction

2008-12-15 Thread Robin Taylor
I don't think it's a daft question at all, but then I am known to ask some very 
daft ones myself :)  

I think the problem is that we wrap the data up in formats that make extraction 
difficult and then need to go to great lengths to try and extract that data. I 
don't know of any widely used, reliable methos as yet. Better to move towards 
formats that make extraction easy. Microsoft docx documents looks like a step 
in the right direction to me. It's a normal Word document but is stored as xml 
and hence is readable programatically. In addition the author can add their own 
tags, so there is no reason why they should not tag the abstract, references, 
etc. In theory it should be easy to then extract that information.  

I'm sure there are good reasons why we all favour pdf's but I think the 
principle still applies.

Cheers, Robin.




Robin Taylor
Main Library
University of Edinburgh
Tel. 0131 6515208  

 -Original Message-
 From: Andrew Marlow [mailto:marlow.and...@googlemail.com] 
 Sent: 13 December 2008 23:53
 To: dspace-tech@lists.sourceforge.net
 Subject: [Dspace-tech] standards to facilitate metadata 
 extraction duringtext extraction
 
 This may seem like a crazy or naive question, but is there 
 any standard laid down by publishers or societies that 
 authors must adhere to so that the extraction of metadata 
 from articles can be easily automated? Having just performed 
 a text extraction on a non-searchable PDF I see that there is 
 no easy way to get any metadata out. But if a society had 
 conventions for the layour of the article, specifying 
 location and format of title, authors, abstract, bibliography 
 etc, then it might be possible. I have seen a very regular 
 visual layout in the PDFs from some places. Using OCR 
 techniques it might be possible to locate blocks of interest. 
 It might also be possible from a text extraction but that 
 might be harder since all visual layout information is gone 
 (at least it was with the tool I used). I wonder if this is 
 being considered by anyone. I am very new to this area so 
 please excuse me if this seems like a silly question.
 --
 Regards,
 
 Andrew M.
 



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] standards to facilitate metadata extraction duringtext extraction

2008-12-15 Thread Andrew Marlow
On Mon, Dec 15, 2008 at 9:36 AM, Robin Taylor robin.tay...@ed.ac.uk wrote:

 I don't think it's a daft question at all, but then I am known to ask some
 very daft ones myself :)

 I think the problem is that we wrap the data up in formats that make
 extraction difficult and then need to go to great lengths to try and extract
 that data. I don't know of any widely used, reliable methos as yet. Better
 to move towards formats that make extraction easy. Microsoft docx documents
 looks like a step in the right direction to me.


No, no, no, please let us not use formats invented by Microsoft. We need
open formats not closed-secret-proprietary ones. And if Microsoft claim it
is open we must not believe them. Just look at their track record. I realise
that PDFs are not completely open either but they are bound to be more open
than anything Microsoft produce. And I was talking about PDFs.

But I do not want the discussion to focus on file formats. As I said
originally,

 But if a society had
 conventions for the layout of the article, specifying
 location and format of title, authors, abstract, bibliography
 etc, then it might be possible


 -Original Message-
  From: Andrew Marlow [mailto:marlow.and...@googlemail.com]
  Sent: 13 December 2008 23:53
  To: dspace-tech@lists.sourceforge.net
  Subject: [Dspace-tech] standards to facilitate metadata
  extraction duringtext extraction
 
  This may seem like a crazy or naive question, but is there
  any standard laid down by publishers or societies that
  authors must adhere to so that the extraction of metadata
  from articles can be easily automated? Having just performed
  a text extraction on a non-searchable PDF I see that there is
  no easy way to get any metadata out. But if a society had
  conventions for the layour of the article, specifying
  location and format of title, authors, abstract, bibliography
  etc, then it might be possible. I have seen a very regular
  visual layout in the PDFs from some places. Using OCR
  techniques it might be possible to locate blocks of interest.
  It might also be possible from a text extraction but that
  might be harder since all visual layout information is gone
  (at least it was with the tool I used). I wonder if this is
  being considered by anyone.

-- 
Regards,

Andrew M.
--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] standards to facilitate metadata extraction duringtext extraction

2008-12-15 Thread Mark H. Wood
On Mon, Dec 15, 2008 at 09:36:11AM +, Robin Taylor wrote:
 I think the problem is that we wrap the data up in formats that make
 extraction difficult and then need to go to great lengths to try and
 extract that data. I don't know of any widely used, reliable methos as
 yet. Better to move towards formats that make extraction easy.

Most common formats other than plain text have some sort of tagging
feature.  In some cases, few know about them so they aren't much
used.  That could be fixed easily.

 Microsoft docx documents looks like a step in the right direction to
 me. It's a normal Word document but is stored as xml and hence is
 readable programatically.

The older Office formats are readable programmatically too.  More
readable, actually, since OOXML is very new, still only partially
documented, and not implemented anywhere, even at Microsoft.  There's
a store for document attributes inside the traditional Office format's
bag.  There's a nice Java library (POI) that can extract them.

But then that only works for MS Office documents.  Not for OpenOffice
or Symphony.  Not for Acrobat.  We have tens of thousands of PDFs.  We
have audio and video streams waiting in the wings.

And we still need a system for assigning meanings to the tags.

   In addition the author can add their own
 tags, so there is no reason why they should not tag the abstract,
 references, etc. In theory it should be easy to then extract that
 information.

See the subject line.  If everybody makes up his own tags then there
is no standard, and software cannot make use of the tags without being
told, for each individual provider's profile, what to look for and
what they mean.  Bibliographic software like EndNote shows us what we
wind up with: hundreds of format modules to be maintained.  We can do
that but I'd rather have something systematic.  (BTW EndNote or one of
its brethren might be able to serve the original request.)

If there is no standard now, then maybe it's up to the document
repository community (that's us) to lay the groundwork for some
standardization and champion the idea until it's accepted.

-- 
Mark H. Wood, Lead System Programmer   mw...@iupui.edu
Friends don't let friends publish revisable-form documents.


pgpFXdB0KGzKu.pgp
Description: PGP signature
--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech