Re: [CODE4LIB] Reference string parsing software available: ParsCit v080402

MJ Suhonos Mon, 17 Nov 2008 02:50:34 -0800

Hi Jonathan,

PS: And indeed, mapping to OpenURL 1.0 is _exactly_ what I need todo. Sounds like I should look into L8X?

There is a demo/testing site at http://www.lemon8.org ; you might wantto try playing around there with some citations to get a feel for howit works without having to download or install anything.

It would be convenient if there were a way to choose which parsersto use with L8X, via an API or configuration if I install thesoftware locally. I'm not sure I'll need to pass the citation to_all_ of them. I am going to be doing this in realtime while theuser is waiting, so speed matters. But just ParsCit alone isn'tdoing the job, perhaps ParsCit+regex plus maybe one more would begood enough.

Absolutely -- setting a list of default parsers to use, and theability to turn them on/off on-the-fly (ie. while editing anyparticular citation) is something that's been on the to-do list for awhile. I'm hoping to have it done in the next week or two.

I should add that having just added ParsCit, I've actually found thatit doesn't do nearly as good a job as some of the other parsers, butthat may just be on the citation formats that I happen to work with.Part of the way L8X is designed is to assign a simple statisticalscore to estimate how accurately each parser performs; one featureI've been planning is to simply allow a threshold to ignore resultsfrom parsers which have done a poor job on that particular citation.

There is some additional functionality to take a parsed citation andlook it up in a number of online indexes, and attempt to fetch"correct" information, both to supplement, say, an incompletecitation, and provide an additional level of quality improvement, butthat's a somewhat more complex topic that I'm hoping to make thesubject of a submission to the Code4Lib journal. :-)

MJ

MJ Suhonos <[EMAIL PROTECTED]> 11/14/08 3:18 PM >>>

Hi all,

John, the supplemented approach you describe is how we go about it in
our Lemon8-XML (L8X) software (http://pkp.sfu.ca/lemon8); The way L8X
handles parsing is it passes the original unparsed string to a number
of different parsers in turn (Freecite, each of the 3 Paracite
parsers, and a home-grown regex parser), does a little cleaning and
normalization, and then hands the results to the user to select the
correct values for each element.

Most of the time, it actually does a pretty good job of detecting the
right elements -- in fact, numeric stuff like volume, issue, pages,
etc. tend to be more accurate than names and titles, mostly because of
the larger variance in the latter.  Our experience has been that
relying on a single approach (machine-learning vs. format-rule-based
vs. regular-expression) is less reliable than getting partial matches
from various approaches, and then assembling them.  In this case, the
whole is in fact greater than the sum of the parts.

I haven't added the ParsCit web service explicitly since a SOAP-based
interface is a bit more cumbersome in PHP than FreeCite's POST-type
interface, but I'll make a point of doing so now.  Incrementally
adding services that all map to the same citation elements (we use the
OpenURL 1.0 fields, with a few aberrations) means it's very easy to
increase the accuracy by simply adding another parsing plugin/service.

You'd have to pull out the relevant classes from L8X to get a
standalone parser, but since this is one of the more appealing aspects
of the software for many people, we're looking at making a simple API
in L8X to just do the citation parsing, possibly without the UI to
take it from semi-automated to completely automatic.

MJ

On 14-Nov-08, at 12:07 AM, Jonathan Rochkind wrote:

Thanks Min, this is a great project, that I keep trying to find time
to investigate more. Don't apologize for keeping us updated, please
continue to!

Do you know if any of the improvements have improved detection of
volume/issue/page# information? For what I want to use it for,
reasonably accurate parsing of volume/issue/page# is needed, and so
far whenever I've looked at demos, this seems to be something that
all of these machine-learning-type approaches do pretty awfully at.
(I wonder if you are not including this in your training much,
because it isn't neccesary for your purposes to have volume/issue/
page#?)

I also have wondered if it would make sense to take a machine-
learning-type approach to begin with, but then supplement it with
formal-rule-based parsing to attempt to get vol/issue/page#
according to common patterns?

I don't have too much time to try work on this myself, but if anyone
who is working on these various citation parsing efforts could
improve volume/issue/page# to a reasonable level, it would make the
libraries useful for a much greater range of applications.

Jonathan

Min-Yen Kan <[EMAIL PROTECTED]> 11/13/08 8:30 PM >>>

Dear all:

(Sorry to resurrect an old thread...)

We've seen the release of several new freely available reference
string parsers in recent months.
The ParsCit team has also been updating the ParsCit package, and is
happy to announce a new version that improves on classification

accuracy, and adds training data in Italian, German and French andfora different discipline of humanities. We've updated theclassification

model to reflect these changes, which should be as easy to use as the
original ParsCit.

You can either download a copy of ParsCit for your own use, or use it
through a web services interface. We welcome your feedback and hope

that if you use ParsCit or any other freely available referencestringparsing tool that you can contribute annotated data to help makethese

models more robust.

ParsCit is available from: http://wing.comp.nus.edu.sg/parsCit/
Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-080917.zip

and is a joint collaboration between Pennsylvania State University
(the folks who brought you CiteSeerX) as well as the National
University of Singapore.

Cheers,

Min

P.S. Integration with other freely available parsing systems is

hopefully in the works too. If you have something to contribute,we'll

be happy to commit some bandwidth into getting it integrated with
ParsCit.

Re: [CODE4LIB] Reference string parsing software available: ParsCit v080402

Reply via email to