On 2 May 2009, at 9:10 AM, Grant Jacobs wrote:
Christiaan & Greg,
Sorry I'm slow getting back to you on this.
Extracting a DOI from a PDF is never fool proof, and moreover some
PDFs (scanned PDFs) don't contain text, only images.
I realise: I wasn't complaining, but summarising for background.
Other software that tries to extract DOIs suffers similar problems.
When I investigated that software (some time ago) I found that the
issue was mostly the parsing / discovery of the DOIs. The way the
journals present the DOIs in the text isn't very consistent and you
get the impression that they never thought that they might be
scanned-for computationally; perhaps this is something the DOI
people need to attend to? Ideally the DOIs would be in a header
field, under a standard tag, not embedded in the text for parsers to
try dig out.
BibDesk gets the bibliography information from NCBI using their
documented query methods, but this apparently drops diacritics.
OK.
It
seems that they also provide an XML format that retains the
diacritics, but we don't know yet what syntax they use. Documentation
about that is extremely sparse and unreadable. So perhaps in the
future we will be able to use that instead and import diacritics.
It might be easier to ask NCBI if they could provide an option in
the existing methods that returns the information with the
diacritics intact. There is no harm in asking and they do sometimes
respond to this sort of thing. I will send a note to them, but I
think it is more likely to result in action if they get something
from someone representing the project and who is familiar with the
scheme they provide. (Email [email protected])
I think it has more to do with the Medline format than the return
option. So I think you can be pretty sure they won't add such an
option. They will almost certainly say you should use the XML format.
I know you mean about the NCBI documentation. (Been there myself for
other things they release.)
Seeing I can't use BibDesk for this, is there any other solution I
might use? Command-line solutions are fine. (Well, provided they are
documented!) Perhaps NCBI has "example" code using their XML-based
alternative?
Thanks,
Grant
I can't see sample code for the XML format on their site. Googling
around a saw a few command line scripts for getting pubmed/medline
XML, mostly python, you might try some of those.
For bibdesk the best thing would be to use the XMl format. But someone
has to implement it. And as I never use it, I'm personally not
interested in spending time on it.
Christiaan
Christiaan
On 30 Apr 2009, at 9:44 AM, Grant Jacobs wrote:
> Synopsis: I am looking for a way of obtaining the title, name, etc.
> from PDFs that retains the original diacritics in the names,
titles,
> etc.
>
>
> I hope there is a simple solution to this that I have overlooked.
>
>
> Background:
>
> You can create new bibliographic entries in BibDesk by dragging PDF
> files of articles (scientific papers in my case) to the main
window.
> I presume what happens is that BibDesk extracts the DOI from the
file
> and uses this to obtain the information (authors, title, abstract,
> etc.) from the internet. This is an excellent feature, even
though it
> isn't foolproof: it sometimes seems to simply fail despite there
> being a DOI in the article.
>
>
> Problem:
>
> However there is a catch! Despite BibDesk being able to handle
> diacritics (the accents or cedilla added to letters in some
languages
> to indicate pronunciation differences), these are "dropped"
somewhere
> along the way and the resulting bibliographic entries lack them.
>
>
> A little testing:
>
> This seems to apply to all articles. I've tried different journals,
> and it's always the same, no diacritics.
>
> The articles at Pubmed or the original sources the DOIs point to
have
> the diacritics in the author's names, etc., despite that the the
> downloaded information obtained from the DOI has stripped them out.
>
>
> Queries:
>
> Is it that once the DOI information is obtained, the characters are
> "reduced" to their "plain" ASCII equivalents?
>
> Is there some option or something that I need to set to enable this
> to stop happening so that I might receive the names with their
> diacritics? (Or, rather, the internally corrected form; I
understand
> that internally they are mapped into LaTeX equivalents.)
>
>
> Grant
--
------------------------------------------------------------------------------
Register Now & Save for Velocity, the Web Performance & Operations
Conference from O'Reilly Media. Velocity features a full day of
expert-led, hands-on workshops and two days of sessions from industry
leaders in dedicated Performance & Operations tracks. Use code
vel09scf
and Save an extra 15% before 5/3.
http://p.sf.net/sfu/velocityconf_______________________________________________
Bibdesk-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bibdesk-users
------------------------------------------------------------------------------
Register Now & Save for Velocity, the Web Performance & Operations
Conference from O'Reilly Media. Velocity features a full day of
expert-led, hands-on workshops and two days of sessions from industry
leaders in dedicated Performance & Operations tracks. Use code vel09scf
and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf
_______________________________________________
Bibdesk-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bibdesk-users