Re: [Bibdesk-develop] Feature: Dropped PDFs parsed for dois, then BibItem generated from PubMed

Michael McCracken Wed, 14 Jan 2009 06:45:28 -0800

On Tue, Jan 13, 2009 at 5:54 PM, Maxwell, Adam R <adam.maxw...@pnl.gov> wrote:
> On 01/13/09 13:38, "Gregory Jefferis" <jeffe...@gmail.com> wrote:
>
>> I have written some code to parse dropped pdfs for DOIs and generate a new
>> BibItem if they can be found on PubMed.  This uses PDFKit to generate a text
>> representation of the first two pages of the pdf followed by a regex search
>> for DOIs.  The DOI regexes have been tested by me for a while in a
>> standalone program and are quite forgiving.
>
> This is really just to make the rest of us envious of biologists for having
> such a great database service...


I note that DOIs seem to be cropping up on most recent IEEE and ACM
PDFs, and you can resolve them at dx.doi.org just fine...

Although interestingly, the IEEE chooses meaningful DOIs like
"10.1109/HOTI.2008.11" (HOTI = Hot Interconnects)
where ACM's DLs are awful and devoid of meaning: 10.1145/1457246.1457247

So - yes please let's find a way to make this work with other sources!
DOI parsing could be a killer feature.

>> A second feature allows an optional external script to be called before full
>> text doi parsing that can rapidly check a PDF's attributes for a doi or
>> check if the pdf name conforms to certain patterns typical of Elsevier or
>> Nature Publishing Group journals.  This speeds up addition of some PDFs. I
>> guess this would make something hackable by knowledgeable end users.  I have
>> such a script that I have been using for a while.
>
> If there's a way to do this with a script hook, that would be preferable;
> otherwise, I think a real plugin might be a better approach from a security
> standpoint...
>
>> 0) Does this seem a reasonable addition?  Have I missed any similar
>> functionality?
>
> The only thing similar is using filename as PMID, and it looks like you
> discovered the category creating a BibItem by PMID.
>
>> 1) What's the best way to share this for testing?
>
> If it gets checked in, it'll be in the next nightly build.  I've posted test
> builds for stuff like this on my personal web space, though, just to get
> user feedback (sometimes backfires).
>
>> 2) Would anyone be prepared to review my code before (or after) I commit to
>> trunk - I confess I'm a little sketchy on Obj C memory management, so there
>> is always the possibility of a missed release.  It's ~100 lines of code.
>
> Minor nits:
>
> + (id)itemByParsingPdf:(NSString *)pdfPath UsingExternalScript:(NSString
> *)scriptPath;
>
> Downcase the first letter of usingExternalScript in the method signature.
>
> You're leaking a PDFDocument and NSTask (based on my quick read).  FWIW,
> clang static analysis is a great way to catch memory management bugs.
>
>> 3) Would anyone see a way to extend this kind of functionality to other
>> bibliographic sources besides PubMed? (Not that I was planning to implement
>> this as well, but rather so that anything I do could be left modular enough
>> for others to extend)
>
> A plugin that returns a parseable data type (RIS/BibTeX/Medline) is all I
> can think of.  It would be similar to your task approach, but PubMed and
> others would all be on the same footing.  Maybe the plugin could return a
> dictionary or some proxy object that could be used to create a BibItem.
>
> Mike, any thoughts on this from a PyObjC perspective?  Acorn seems to do
> pretty well with scripting language plugins.

No specific thoughts.
There are two ways to do this: a PyObjC plugin could be a Cocoa bundle
just like any other, so BibDesk wouldn't even have to know it wasn't
loading ObjC.
Or, you write one such bundle that just loads raw Python files
(modules) that use ObjC and call those files plugins.
This is what VoodooPad does - there's one full PyObjC Cocoa bundle
that loads the 'python plugins'.

It occurs to me that plugins written in a scripting language change
the balance of the plugin argument.
If you're writing an ObjC plugin I still think you might as well be
just building the full project.
But-  if you're writing (say) a new Web group parser in Python, you
don't need XCode at all, so maybe things like that should be plugins.

-mike


>> 4) Is there a way that this functionality could be called when a PDF is
>> viewed in BibDesk's built in browser (eg by browsing a PDF at a journal's
>> website)?
>
> In the web group?  You could check for the PDF mime type and possibly do
> something clever.  It might block for a while, depending on your PDF.
>
> --
> Adam
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword
> _______________________________________________
> Bibdesk-develop mailing list
> Bibdesk-develop@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bibdesk-develop
>



-- 
Michael McCracken
UCSD CSE PhD Candidate
research: http://www.cse.ucsd.edu/~mmccrack/
misc: http://michael-mccracken.net/wp/

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Bibdesk-develop mailing list
Bibdesk-develop@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bibdesk-develop

Re: [Bibdesk-develop] Feature: Dropped PDFs parsed for dois, then BibItem generated from PubMed

Reply via email to