Re: [CODE4LIB] Anyone web scraping to benefit their library?

Guy Dobson Tue, 28 Nov 2017 12:34:32 -0800

There's a Very Large Catalog that includes the holdings of many libraries.
Links from that VLC into our OPAC (aka Walter) assume that our MARCs
include the relevant VLC #s (or sometimes an ISBN) which they don't always
do (often because many of our VLC numbers appear to be obsolete). So, when
Walter can't find anything for a given VLC# he screen scrapes the VLC to
find out the title and author (if there is one) so that he can then send
you into the OPAC with a keyword search that might work.


I've been keeping a log of titles that have required this intervention and
am happy to report that it works well. When it doesn't we now have the
information we need to follow-up and remove our holdings from the Very
Large Catalog.

Guy



<http://www.drew.edu/?utm_source=FIL_Email_Footer&utm_medium=email&utm_campaign=FIL%2BEmail%2BFooter>
  *Guy Dobson
<http://www.drew.edu/directory/?q=email:gdobson&utm_source=FIL_Email_Footer&utm_medium=email&utm_campaign=FIL%2BEmail%2BFooter>*
Systems Librarian | Library
<http://www.drew.edu/library?utm_source=FIL_Email_Footer&utm_medium=email&utm_campaign=FIL%2BEmail%2BFooter>
Drew University | 36 Madison Ave | Madison, NJ 07940
(973) 408-3207 | drew.edu
<http://www.drew.edu/?utm_source=FIL_Email_Footer&utm_medium=email&utm_campaign=FIL%2BEmail%2BFooter>

<http://www.drew.edu/undergraduate/?utm_source=FIL_Email_Footer&utm_medium=email&utm_campaign=FIL%2BEmail%2BFooter>


On Tue, Nov 28, 2017 at 3:13 PM, Kenny Ketner <kenny.ket...@gmail.com>
wrote:

> Brad et al,
>
>
> We use wget scripts to back up our internet archive pages, which, oddly
> enough, are the instructions given by internet archive itself. :/
>
>
>
> Kenny Ketner
> Information Products Lead
> Montana State Library
> 406-444-2870
> kket...@mt.gov
> kennyketner.com
>
> On Tue, Nov 28, 2017 at 12:31 PM, Brett <brett.l.willi...@gmail.com>
> wrote:
>
> > Yes, I did ask, and ask, and ask, and waited for 2 months. There was
> > something political going on internally with that group that was well
> > beyond my pay grade.
> >
> > I did explain the potential problems to my boss and she was providing
> > cover.
> >
> > I did it in batches as Google Sheets limits the amount of ImportXML that
> > you can do in a 24 hour span, so I wasn't hammering anyone's web server
> > into oblivion.
> >
> > It's funny, I actually had to do a fair amount to get the old V1
> LibGuides
> > link checker to stop hammering my ILS into going offline back in
> 2010-2011.
> >
> >
> >
> > On Tue, Nov 28, 2017 at 2:18 PM, Bill Dueber <b...@dueber.com> wrote:
> >
> > > Brett, did you ask the folks at the Large University Library if they
> > could
> > > set something up for you? I don't have a good sense of how other
> > > institutions deal with things like this.
> > >
> > > In any case, I know I'd much rather talk about setting up an API or a
> > > nightly dump or something rather than have my analytics (and
> bandwidth!)
> > > blown by a screen scraper. I might say "no," but at least it would be
> an
> > > informed "no" :-)
> > >
> > > On Tue, Nov 28, 2017 at 2:08 PM, Brett <brett.l.willi...@gmail.com>
> > wrote:
> > >
> > > > I leveraged the IMPORTXML() and xpath features in Google Sheets to
> pull
> > > > information from a large university website to help create a set of
> > > weeding
> > > > lists for a branch campus. They needed  extra details about what was
> in
> > > > off-site storage and what was held at the central campus library.
> > > >
> > > > This was very much like Jason's FIFO API, the central reporting group
> > had
> > > > sent me a spreadsheet with horrible data that I would have had to
> sort
> > > out
> > > > almost completely manually, but the call numbers were pristine. I
> used
> > > the
> > > > call numbers as a key to query the catalog with limits for each
> campus
> > I
> > > > needed to check, and then it dumped all of the necessary content
> > > (holdings,
> > > > dates, etc) into the spreadsheet.
> > > >
> > > > I've also used Feed43 as a way to modify certain RSS feeds and scrape
> > > > websites  to only display the content I want.
> > > >
> > > > Brett Williams
> > > >
> > > >
> > > > On Tue, Nov 28, 2017 at 1:24 PM, Brad Coffield <
> > > > bcoffield.libr...@gmail.com>
> > > > wrote:
> > > >
> > > > > I think there's likely a lot of possibilities out there and was
> > hoping
> > > to
> > > > > hear examples of web scraping for libraries. Your example might
> just
> > > > > inspire me or another reader to do something similar. At the very
> > > least,
> > > > > the ideas will be interesting!
> > > > >
> > > > > Brad
> > > > >
> > > > >
> > > > > --
> > > > > Brad Coffield, MLIS
> > > > > Assistant Information and Web Services Librarian
> > > > > Saint Francis University
> > > > > 814-472-3315
> > > > > bcoffi...@francis.edu
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Bill Dueber
> > > Library Systems Programmer
> > > University of Michigan Library
> > >
> >
>

Re: [CODE4LIB] Anyone web scraping to benefit their library?

Reply via email to