Re: [Bibdesk-users] Adding biblio entries by adding PDFs "lose" diacritics

Grant Jacobs Sat, 02 May 2009 00:11:08 -0700

Christiaan & Greg,

Sorry I'm slow getting back to you on this.

Extracting a DOI from a PDF is never fool proof, and moreover somePDFs (scanned PDFs) don't contain text, only images.


I realise: I wasn't complaining, but summarising for background.

Other software that tries to extract DOIs suffers similar problems.When I investigated that software (some time ago) I found that theissue was mostly the parsing / discovery of the DOIs. The way thejournals present the DOIs in the text isn't very consistent and youget the impression that they never thought that they might bescanned-for computationally; perhaps this is something the DOI peopleneed to attend to? Ideally the DOIs would be in a header field, undera standard tag, not embedded in the text for parsers to try dig out.

BibDesk gets the bibliography information from NCBI using their
documented query methods, but this apparently drops diacritics.

OK.

Itseems that they also provide an XML format that retains thediacritics, but we don't know yet what syntax they use. Documentationabout that is extremely sparse and unreadable. So perhaps in thefuture we will be able to use that instead and import diacritics.

It might be easier to ask NCBI if they could provide an option in theexisting methods that returns the information with the diacriticsintact. There is no harm in asking and they do sometimes respond tothis sort of thing. I will send a note to them, but I think it ismore likely to result in action if they get something from someonerepresenting the project and who is familiar with the scheme theyprovide. (Email [email protected])

I know you mean about the NCBI documentation. (Been there myself forother things they release.)

Seeing I can't use BibDesk for this, is there any other solution Imight use? Command-line solutions are fine. (Well, provided they aredocumented!) Perhaps NCBI has "example" code using their XML-basedalternative?



Thanks,

Grant

Christiaan

On 30 Apr 2009, at 9:44 AM, Grant Jacobs wrote:

 Synopsis: I am looking for a way of obtaining the title, name, etc.
 from PDFs that retains the original diacritics in the names, titles,
 etc.


 I hope there is a simple solution to this that I have overlooked.


 Background:

 You can create new bibliographic entries in BibDesk by dragging PDF
 files of articles (scientific papers in my case) to the main window.
 I presume what happens is that BibDesk extracts the DOI from the file
 and uses this to obtain the information (authors, title, abstract,
 etc.) from the internet. This is an excellent feature, even though it
 isn't foolproof: it sometimes seems to simply fail despite there
 being a DOI in the article.


 Problem:

 However there is a catch! Despite BibDesk being able to handle
 diacritics (the accents or cedilla added to letters in some languages
 to indicate pronunciation differences), these are "dropped" somewhere
 along the way and the resulting bibliographic entries lack them.


 A little testing:

 This seems to apply to all articles. I've tried different journals,
 and it's always the same, no diacritics.

 The articles at Pubmed or the original sources the DOIs point to have
 the diacritics in the author's names, etc., despite that the the
 downloaded information obtained from the DOI has stripped them out.


 Queries:

 Is it that once the DOI information is obtained, the characters are
 "reduced" to their "plain" ASCII equivalents?

 Is there some option or something that I need to set to enable this
 to stop happening so that I might receive the names with their
 diacritics? (Or, rather, the internally corrected form; I understand
 that internally they are mapped into LaTeX equivalents.)

 > Grant

--

------------------------------------------------------------------------------
Register Now & Save for Velocity, the Web Performance & Operations 
Conference from O'Reilly Media. Velocity features a full day of 
expert-led, hands-on workshops and two days of sessions from industry 
leaders in dedicated Performance & Operations tracks. Use code vel09scf 
and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf

_______________________________________________
Bibdesk-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bibdesk-users

Re: [Bibdesk-users] Adding biblio entries by adding PDFs "lose" diacritics

Reply via email to