Re: [Dspace-tech] Data Scraping for a IR

Nick Wed, 14 Nov 2012 09:26:38 -0800

Yea, that sounds great. I appreciate the help.

On Wed, Nov 14, 2012 at 2:34 AM, helix84 <[email protected]> wrote:


> On Tue, Nov 13, 2012 at 10:13 PM, Nick <[email protected]> wrote:
> > Has anyone gone about the process of collecting open-access articles
> (from
> > your university's faculty) from online respotiories or databases and then
> > bringing all that data to your institutional repository? I was looking
> > around on Google and could find very little information about such
> projects.
> > Can anyone provide a broad and simple overview of the process and scripts
> > that they used?
>
> Hi Nick,
>
> we've been periodically importing data from Scopus and WoS. They allow
> you to export the results of a search query, in our case it's
> different forms of the institution name, but it can also be a list of
> authors, or anything, really. Then I created some scripts which
> convert the exported CSVs to one CSV in a format that DSpace can
> ingest using the Batch Metadata Editor. Importantly, there are the UT
> and Scopus ID identifiers included and my script looks up, whether
> they already are in repository (a lookup in the full export from the
> repository in CSV format) and eliminates these from the current round
> of import.
>
> Then you _have to_ do some manual work. Namely find duplicates by name
> between the two databases (and between each and the records in
> repository), but it's not always an exact match. I've been thinking of
> making it easier by finding the candidates using Levenshtein distance
> of titles. Next thing in manual editing is fixing author names, which
> usually lose most non-ASCII characters. Also fix other metadata like
> volume, issue, starting and ending page, conference date format etc.,
> add links to full text online and DOI. You also have to ask the
> authors for any preprints or postprints and upload the bitstreams to
> DSpace manually (after import).
>
> I also enrich the data from other sources, like the SNIP and SJR
> citation metrics.
>
> My scripts are quite customized, so you'd have to edit them a little
> to suit your needs, but I can share them with you if you want. They're
> written in Python and work only with CSV files (no connection to the
> repository necessary) - Scopus export, WoS export,
> last complete export of DSpace - and output the DSpace import CSV. I
> can say it works to our satisfaction and almost everything that can be
> is automated (except the lookup and export from the websites, which
> could be done with something like Mechanize). Does that sound like
> something you're interested in?
>
>
> Regards,
> ~~helix84
>
> Compulsory reading: DSpace Mailing List Etiquette
> https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
>

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Data Scraping for a IR

Reply via email to