Re: [CODE4LIB] Best practices for improving metadata extractability from journal articles?

Han, Yan - (yhan) Tue, 26 Mar 2019 13:52:13 -0700

XMP is the required metadata standard used in any PDF standards. In the past, 
you can code metadata in PDF with either Document info or XMP. Document info 
was deprecated in the recent released PDF 2.0. 
XMP is based on RDF model and is very powerful to do any metadata you needed.


You might consider to get your article to conform to PDF/A standard (probably 
PDF/A-2b ). Other enhancement can be done at later stage (e.g. PDF/UA 
compliance)

I suggest you to use common PDF SDK, including Adobe, iText, Foxit.   

Best,
Yan

On 3/26/19, 10:34 AM, "Code for Libraries on behalf of Custer, Mark" 
<[email protected] on behalf of [email protected]> wrote:

    Dear Jason,
    
    I only have experience with creating PDFs using Apache FOP, but you can 
embed metadata in a PDF file.  One approach is to use Adobe's XMP 
(https://www.adobe.com/products/xmp.html) standard, which is also an ISO 
standard.  
    
    Have you tried adding XMP to your PDFs to see what sort of support is 
available from Zotero, Mendeley, etc?  I looked at one example PDF from 
brit.org but it doesn't look to include any embedded metadata, so that might be 
a good place to start.  Also, another great thing about a tool like Apache FOP 
is that you can utilize it to help ensure that the resulting PDF meets 
accessibility standards and/or guidelines 
(https://xmlgraphics.apache.org/fop/2.3/accessibility.html), such as PDF/UA.  
    
    In any event, I'd love to hear more about what approach you take once you 
find out what works best.
    
    All my best,
    
    Mark
    
    
    
    -----Original Message-----
    From: Code for Libraries [mailto:[email protected]] On Behalf Of 
Jason Best
    Sent: Tuesday, 26 March, 2019 12:54 PM
    To: [email protected]
    Subject: [CODE4LIB] Best practices for improving metadata extractability 
from journal articles?
    
    Hello,
    I’m working with our journal to improve the quality of the metadata that 
can be extracted from PDFs of individual journal articles by reference 
management software like Zotero, Mendeley, EndNote, etc. The only description 
I’ve found of the metadata extraction process is from Zotero 
(https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zotero.org%2Fsupport%2Fretrieve_pdf_metadata&amp;data=02%7C01%7Cmark.custer%40yale.edu%7C51a1d79cc33148bdc9c708d6b20b9873%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C1%7C636892160236074799&amp;sdata=VtqHt3em5sX1KN8jBJUQhRHDR3Ji1F5OpBXYjKSjL4I%3D&amp;reserved=0)
 which "sends the first few pages of a PDF to the web service, which uses a 
variety of extraction algorithms and known metadata from CrossRef, paired with 
DOI and ISBN lookups, to build a parent item for the PDF”. What I haven’t found 
yet is a description of how to format the text of a PDF to ensure that the 
article metadata can be reliably extracted by reference managers. Most of these 
journal articles were published before we were issuing DOIs (or even before 
DOIs existed) so I’ll be adding a cover page to all the PDFs with title, 
authors, issue, pages, doi (issued retroactively), issn, etc. I’d like to 
format these pages in a way that ensures optimal extraction of metadata 
emphasizing of course the DOI and ISSN. In my experience, Mendeley can 
sometimes extract the article metadata fairly well even without a DOI lookup so 
I’d to aim for a format that is easily parsable in this way and not 100% 
relying on a DOI lookup. Does anyone have any experience or suggestions on how 
to craft such a page to work well across different reference managers?
    
    Regards,
    Jason
    
    Jason Best
    Director of Biodiversity Informatics
    Botanical Research Institute of Texas
    1700 University Drive
    Fort Worth, Texas 76107
    
    817-332-4441 ext. 230
    
https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.brit.org&amp;data=02%7C01%7Cmark.custer%40yale.edu%7C51a1d79cc33148bdc9c708d6b20b9873%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C1%7C636892160236074799&amp;sdata=DDiKuGobStcjKVDB7MaFlzBBsuVQxwwpbYf%2FlpkZqDI%3D&amp;reserved=0

Re: [CODE4LIB] Best practices for improving metadata extractability from journal articles?

Reply via email to