+1.  Right now, we can incorporate other projects by simply running the
same script on other XML dumps.  We'll likely want to set up a job that
tracks the creation of new historical dumps so that we can produce new,
updated ID dumps ASAP.

If we drop the requirement of knowing when a citation was first added to an
article, we could use the externallinks tables.  That would allow us to
generate these datasets much faster.  I'd like to only pursue this option
if we find that processing the dumps becomes difficult to do on the monthly
basis.  Right now, it doesn't look like that will be the case.

The realtime reporting project sounds interesting.  Is there a project page
or some code we could review?

-Aaron

On Tue, Feb 3, 2015 at 9:28 AM, Dario Taraborelli <
[email protected]> wrote:

> Hi Nemo
>
> >> The dataset currently includes the first known occurrence of a PMID or
> PMCID citation in an English Wikipedia article and the associated revision
> metadata, based on the most recent complete content dump of English
> Wikipedia.
> >
> > Do you accepted patches for inclusion of other wikis? The easiest way to
> include all Wikimedia projects is probably to use the external links table,
> we can see how big a difference there is.
>
> we definitely welcome patches and pull requests [1]. This is our current
> priority list (subject to other priorities unrelated to this project):
>
> 1. add other identifiers (DOIs are next)
> 2. include other languages / projects
> 3. generate recurring reports (e.g. once a month)
>
> Aaron, does that sound about right? Also note that other people on this
> list (Max, Daniel) are working on real-time reporting of DOI citations in
> collaboration with CrossRef.
>
> D
>
> [1]
> https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia
_______________________________________________
OpenAccess mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/openaccess

Reply via email to