[CODE4LIB] Greenstone: tweaking Lucene indexing
Hello, Sorry for any cross-posting annoyance. I have a request for a Greenstone collection I'm working on, to add context snippets to search results; for example a search for yak culture might return this in the list of results: ... addressing the fine points of strongyak culture/strong, the zoosociologists took into account ... Sounds like a pretty basic feature, say our sponsors, and I agree. (Ah, it's also an old Trac ticket at http://trac.greenstone.org/ticket/444) I see that GS out-of-the-box is set *not* to store the fulltext in the index, which seems to be a prerequisite for this kind of thing, as in http://bit.ly/ljNkL . Has anyone modified the Lucene indexing wrapper locally to do this? Given that we don't have any Java coders on staff, I've started porting the Lucene wrapper to PHP for use with a custombuilder.pl and Zend_Search_Lucene. I already have a PHP frontend, so adjusting that to display the results shouldn't be a problem; OTOH because the frontend is PHP, I'm restricted to using buildtype lucene, or something else with good PHP support. Many thanks, -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
The Lucene Highlighter doesn't require that the text you want highlighted be stored. In fact, you can pass in any arbitrary text to the Highlighter. See the various getBestFragments from the Highlighter class: http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/highlight/Highlighter.html Erik On Sep 29, 2009, at 7:01 AM, Yitzchak Schaffer wrote: Hello, Sorry for any cross-posting annoyance. I have a request for a Greenstone collection I'm working on, to add context snippets to search results; for example a search for yak culture might return this in the list of results: ... addressing the fine points of strongyak culture/strong, the zoosociologists took into account ... Sounds like a pretty basic feature, say our sponsors, and I agree. (Ah, it's also an old Trac ticket at http://trac.greenstone.org/ticket/444) I see that GS out-of-the-box is set *not* to store the fulltext in the index, which seems to be a prerequisite for this kind of thing, as in http://bit.ly/ljNkL . Has anyone modified the Lucene indexing wrapper locally to do this? Given that we don't have any Java coders on staff, I've started porting the Lucene wrapper to PHP for use with a custombuilder.pl and Zend_Search_Lucene. I already have a PHP frontend, so adjusting that to display the results shouldn't be a problem; OTOH because the frontend is PHP, I'm restricted to using buildtype lucene, or something else with good PHP support. Many thanks, -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
Erik Hatcher wrote: The Lucene Highlighter doesn't require that the text you want highlighted be stored. In fact, you can pass in any arbitrary text to the Highlighter. Thanks Erik, What I'm looking for is to return the context of the search result, not just the ID of the containing document - e.g. when all I input is yak culture, I get back the context from the document as a search result, without having to retrieve the doc itself: ... addressing the fine points of strongyak culture/strong, the zoosociologists took into account ... GS out of the box does not appear to support this, as it does not store the fulltext in the index. So yes, I can highlight stuff, but as it stands, I don't have the text to work with. IANA Lucene guru, so correct me if I misunderstand. -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
On Sep 29, 2009, at 7:33 AM, Yitzchak Schaffer wrote: Erik Hatcher wrote: The Lucene Highlighter doesn't require that the text you want highlighted be stored. In fact, you can pass in any arbitrary text to the Highlighter. Thanks Erik, What I'm looking for is to return the context of the search result, not just the ID of the containing document - e.g. when all I input is yak culture, I get back the context from the document as a search result, without having to retrieve the doc itself: ... addressing the fine points of strongyak culture/strong, the zoosociologists took into account ... GS out of the box does not appear to support this, as it does not store the fulltext in the index. So yes, I can highlight stuff, but as it stands, I don't have the text to work with. IANA Lucene guru, so correct me if I misunderstand. I'm a bit confused then. You mentioned that somehow Zend Lucene was going to help, but if you don't have the text to highlight anywhere then the Highlighter isn't going to be of any use. Again, you don't need the full text in the Lucene index, but you do need it get it from somewhere in order to be able to highlight it. Erik
[CODE4LIB] Bookmarking web links - authoritativeness or focused searching
I've been thinking about the role of libraries as promoter of authoritative works - helping to select and sort the plethora of information out there. And I heard another presentation about social media this morning. So I though I'd bring up for discussion here some of the ideas I've been mulling over. Last week I sent this message to the Suggestions and Ideas forum at delicious. http://support.delicious.com/forum/comments.php?DiscussionID=3237page=1#Item_0 The basic idea is to develop a delicious network of librarians. Or a network of faculty members. Then have one login whose network included those users, and share that login so that lots of people could share that network. Delicious responded that we could have a wiki where people posted their delicious names so that others could add them to their personal networks, but that doesn't scale up very well. Or another project I've toyed with, involving focused searching: I started with Robert Teeter's index to Great Books lists. http://www.interleaves.org/~rteeter/grtalphaa.htmlhttp://www.interleaves.org/%7Erteeter/grtalphaa.html. I've almost completed pulling them into a MySQL database so that I could sort the titles by the number of Great Books lists that mention each title. Then I thought about how one could do focused searching of the web, collecting pages with a title containing (best and books) or (great and books), and screen scraping title lists (you'd have to have some heuristic method of identifying the data, of course, and I'm aware what problems might arise there). But my test searches in that idea showed that one runs into a lot of commercial ephemeral lists and spurious lists. Now, you could rely on crowd-sourcing to filter out the consensus by ranking by the number of sites/cites. But I thought you might want to differentiate between the source - .edus, librarys, etc. So that led me to speculate about a search engine that ranked just by links from .edu's, libraries sites, and a librarian-vetted list of .orgs, scholarly publishers, etc. I think you can limit by .edu in the linked-from in Google - I haven't tried that much. if anyone here has experience at using tha technique, I'd like to hear about it. But I'm thinking now about the possibility of a search engine limited to sites cooperatively vetted by librarians, that would incorporate ranking by # links. Something more responsive than cataloging websites in our catalogs. Is anyone else thinking about these ideas? or do you know of projects that approach this goal of leveraging librarian's vetting of authoritative sources? Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edu 315-228-7363
Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching
www.diigo.com is a social bookmarking site like delicious and it has added features like creating groups around specific themes and the ability to annotate the Web pages you bookmark for future reference. You might want to explore this feature and see if it is appropriate for what you envision. Kent Kent Gerber, MSLIS Digital Library Manager Bethel University St. Paul, MN phone: 651.638.6937 email: kent-ger...@bethel.edu -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Cindy Harper Sent: Tuesday, September 29, 2009 9:54 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching I've been thinking about the role of libraries as promoter of authoritative works - helping to select and sort the plethora of information out there. And I heard another presentation about social media this morning. So I though I'd bring up for discussion here some of the ideas I've been mulling over. Last week I sent this message to the Suggestions and Ideas forum at delicious. http://support.delicious.com/forum/comments.php?DiscussionID=3237page=1#Item_0 The basic idea is to develop a delicious network of librarians. Or a network of faculty members. Then have one login whose network included those users, and share that login so that lots of people could share that network. Delicious responded that we could have a wiki where people posted their delicious names so that others could add them to their personal networks, but that doesn't scale up very well. Or another project I've toyed with, involving focused searching: I started with Robert Teeter's index to Great Books lists. http://www.interleaves.org/~rteeter/grtalphaa.htmlhttp://www.interleaves.org/%7Erteeter/grtalphaa.html. I've almost completed pulling them into a MySQL database so that I could sort the titles by the number of Great Books lists that mention each title. Then I thought about how one could do focused searching of the web, collecting pages with a title containing (best and books) or (great and books), and screen scraping title lists (you'd have to have some heuristic method of identifying the data, of course, and I'm aware what problems might arise there). But my test searches in that idea showed that one runs into a lot of commercial ephemeral lists and spurious lists. Now, you could rely on crowd-sourcing to filter out the consensus by ranking by the number of sites/cites. But I thought you might want to differentiate between the source - .edus, librarys, etc. So that led me to speculate about a search engine that ranked just by links from .edu's, libraries sites, and a librarian-vetted list of .orgs, scholarly publishers, etc. I think you can limit by .edu in the linked-from in Google - I haven't tried that much. if anyone here has experience at using tha technique, I'd like to hear about it. But I'm thinking now about the possibility of a search engine limited to sites cooperatively vetted by librarians, that would incorporate ranking by # links. Something more responsive than cataloging websites in our catalogs. Is anyone else thinking about these ideas? or do you know of projects that approach this goal of leveraging librarian's vetting of authoritative sources? Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edu 315-228-7363
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
Erik Hatcher wrote: I'm a bit confused then. You mentioned that somehow Zend Lucene was going to help, but if you don't have the text to highlight anywhere then the Highlighter isn't going to be of any use. Again, you don't need the full text in the Lucene index, but you do need it get it from somewhere in order to be able to highlight it. Erik, I started to port the native Greenstone Java Lucene wrapper to PHP, so I could then modify it to add this feature, as I don't know Java. This would mean using Zend Lucene for the actual indexing implementation. My question is whether anyone's already done it, in Java or otherwise. Thanks for the clarification, -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching
It's not social bookmarking, but as far as But I'm thinking now about the possibility of a search engine limited to sites cooperatively vetted by librarians, that would incorporate ranking by # links. Something more responsive than cataloging websites in our catalogs., well, that's almost exactly what lii.org is. http://lii.org I happen to think that authority is dead dead dead as a method of measuring information worth, but that's just me. :-) Jason On Tue, Sep 29, 2009 at 10:53 AM, Cindy Harper char...@colgate.edu wrote: I've been thinking about the role of libraries as promoter of authoritative works - helping to select and sort the plethora of information out there. And I heard another presentation about social media this morning. So I though I'd bring up for discussion here some of the ideas I've been mulling over. Last week I sent this message to the Suggestions and Ideas forum at delicious. http://support.delicious.com/forum/comments.php?DiscussionID=3237page=1#Item_0 The basic idea is to develop a delicious network of librarians. Or a network of faculty members. Then have one login whose network included those users, and share that login so that lots of people could share that network. Delicious responded that we could have a wiki where people posted their delicious names so that others could add them to their personal networks, but that doesn't scale up very well. Or another project I've toyed with, involving focused searching: I started with Robert Teeter's index to Great Books lists. http://www.interleaves.org/~rteeter/grtalphaa.htmlhttp://www.interleaves.org/%7Erteeter/grtalphaa.html http://www.interleaves.org/%7Erteeter/grtalphaa.html. I've almost completed pulling them into a MySQL database so that I could sort the titles by the number of Great Books lists that mention each title. Then I thought about how one could do focused searching of the web, collecting pages with a title containing (best and books) or (great and books), and screen scraping title lists (you'd have to have some heuristic method of identifying the data, of course, and I'm aware what problems might arise there). But my test searches in that idea showed that one runs into a lot of commercial ephemeral lists and spurious lists. Now, you could rely on crowd-sourcing to filter out the consensus by ranking by the number of sites/cites. But I thought you might want to differentiate between the source - .edus, librarys, etc. So that led me to speculate about a search engine that ranked just by links from .edu's, libraries sites, and a librarian-vetted list of .orgs, scholarly publishers, etc. I think you can limit by .edu in the linked-from in Google - I haven't tried that much. if anyone here has experience at using tha technique, I'd like to hear about it. But I'm thinking now about the possibility of a search engine limited to sites cooperatively vetted by librarians, that would incorporate ranking by # links. Something more responsive than cataloging websites in our catalogs. Is anyone else thinking about these ideas? or do you know of projects that approach this goal of leveraging librarian's vetting of authoritative sources? Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edu 315-228-7363 -- Follow me on Twitter! http://www.twitter.com/griffey
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
Yitzchak, are you interested in actually searching the fulltext? Or just highlighting the terms? If you're only interested in highlighting it, it might be a whole lot easier to implement this in javascript through something like jQuery: http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html That way you're not juggling mostly redundant Lucene indexes and trying to keep them synced. How are you getting your search results? Does Greenstone have some sort of search API that returns the highlighted results? Would it make a difference if you could add a field to the Lucene document (meaning would you have access to it through your PHP API to Greenstone)? If so, you could probably do this pretty easily via one of the JVM scripting languages (Groovy, JRuby, Jython, Quercus -- PHP in the JVM) so you just have the single Lucene index instead of multiple. Another approach might be to serve the Lucene index via Solr [1] or Lucene-WS (http://lucene-ws.net/) which would allow you to skip Greenstone altogether for searching. Basically, I would try to avoid going the Zend_Lucene route if at all possible. -Ross. 1. http://www.google.com/search?q=solr+on+an+existing+lucene+indexie=utf-8oe=utf-8aq=trls=org.mozilla:en-US:officialclient=firefox-a On Tue, Sep 29, 2009 at 11:32 AM, Yitzchak Schaffer yitzchak.schaf...@gmx.com wrote: Erik Hatcher wrote: I'm a bit confused then. You mentioned that somehow Zend Lucene was going to help, but if you don't have the text to highlight anywhere then the Highlighter isn't going to be of any use. Again, you don't need the full text in the Lucene index, but you do need it get it from somewhere in order to be able to highlight it. Erik, I started to port the native Greenstone Java Lucene wrapper to PHP, so I could then modify it to add this feature, as I don't know Java. This would mean using Zend Lucene for the actual indexing implementation. My question is whether anyone's already done it, in Java or otherwise. Thanks for the clarification, -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching
Cindy Harper wrote: I've been thinking about the role of libraries as promoter of authoritative works - helping to select and sort the plethora of information out there. And I heard another presentation about social media this morning. So I though I'd bring up for discussion here some of the ideas I've been mulling over. [...] Is anyone else thinking about these ideas? or do you know of projects that approach this goal of leveraging librarian's vetting of authoritative sources? The big problem with social media sites is that they tend towards privatising our data. Any solution needs to be both FOSS and Open Data to overcome that. Some of the veterans here will probably remember the ODP (dmoz.org) and VLib.org catalogues. Can we build on them instead of inventing another wheel? Thanks, -- MJ Ray (slef) LMS developer and webmaster at | software www.software.coop http://mjr.towers.org.uk| co IMO only: see http://mjr.towers.org.uk/email.html | op
Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching
On Tue, 29 Sep 2009, Cindy Harper wrote: I've been thinking about the role of libraries as promoter of authoritative works - helping to select and sort the plethora of information out there. And I heard another presentation about social media this morning. So I though I'd bring up for discussion here some of the ideas I've been mulling over. [trimmed] Is anyone else thinking about these ideas? or do you know of projects that approach this goal of leveraging librarian's vetting of authoritative sources? I don't know of any projects that specifically do what you've mentioned, but for the last few years, we've been mulling over how to store various lists and catalogs so that we could present interesting intersections of them. In my case, I deal with scientific catalogs, so it's stuff like when was RHESSI observing the same area as TRACE? or When was there an X-class flare within 2 hours of a CME? or even lack of intersections When were there type-II radio bursts without a CME or flare within 6 hours? For the science catalogs, we specifically don't want to just make some sort of single ranking from each list, and it's not really easy to merge the catalogs into some form of union catalog as they're cataloging different concepts. ... and I think that there's use in library searches to keep the catalogs different, particularly when you're bringing up authority (which then gets to reputation, etc.). I'm not sure how many other people out there would try to search for Hugo award winning novels that weren't on the New York Times best seller list, so it might not be as useful for general patron use ... unless you could give it your *own* catalog (AFI top 100 movies ... that I don't already own) - Joe Hourcle Solar Data Analysis Center Goddard Space Flight Center
Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching
I feel like a couple years ago a librarian(s?) created a Google Custom Search Engine that did exactly what you describe as focused searching, but I can't find a link any more. You can search the CSEs by scrolling down on this page (and there are a couple of links to directories, too): http://www.lib.berkeley.edu/find/types/websites.html Also, Mike Eisenberg over at the University of Washington was working on that kind of problem with some other groups...A quick search reveals that it's now called Reference Extract and it's being done in conjunction with Syracuse University (and OCLC is somehow involved). http://chronicle.com/blogPost/Librarians-Want-to-Out-Google/4365 ~Amy From: Cindy Harper [char...@colgate.edu] Sent: Tuesday, September 29, 2009 10:53 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching I've been thinking about the role of libraries as promoter of authoritative works - helping to select and sort the plethora of information out there. And I heard another presentation about social media this morning. So I though I'd bring up for discussion here some of the ideas I've been mulling over. Last week I sent this message to the Suggestions and Ideas forum at delicious. http://support.delicious.com/forum/comments.php?DiscussionID=3237page=1#Item_0 The basic idea is to develop a delicious network of librarians. Or a network of faculty members. Then have one login whose network included those users, and share that login so that lots of people could share that network. Delicious responded that we could have a wiki where people posted their delicious names so that others could add them to their personal networks, but that doesn't scale up very well. Or another project I've toyed with, involving focused searching: I started with Robert Teeter's index to Great Books lists. http://www.interleaves.org/~rteeter/grtalphaa.htmlhttp://www.interleaves.org/%7Erteeter/grtalphaa.html. I've almost completed pulling them into a MySQL database so that I could sort the titles by the number of Great Books lists that mention each title. Then I thought about how one could do focused searching of the web, collecting pages with a title containing (best and books) or (great and books), and screen scraping title lists (you'd have to have some heuristic method of identifying the data, of course, and I'm aware what problems might arise there). But my test searches in that idea showed that one runs into a lot of commercial ephemeral lists and spurious lists. Now, you could rely on crowd-sourcing to filter out the consensus by ranking by the number of sites/cites. But I thought you might want to differentiate between the source - .edus, librarys, etc. So that led me to speculate about a search engine that ranked just by links from .edu's, libraries sites, and a librarian-vetted list of .orgs, scholarly publishers, etc. I think you can limit by .edu in the linked-from in Google - I haven't tried that much. if anyone here has experience at using tha technique, I'd like to hear about it. But I'm thinking now about the possibility of a search engine limited to sites cooperatively vetted by librarians, that would incorporate ranking by # links. Something more responsive than cataloging websites in our catalogs. Is anyone else thinking about these ideas? or do you know of projects that approach this goal of leveraging librarian's vetting of authoritative sources? Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edu 315-228-7363
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
Ross Singer wrote: Yitzchak, are you interested in actually searching the fulltext? Or just highlighting the terms? Sorry this wasn't clearer. Let me re-summarize, and report on a new development: - Greenstone allows for Lucene as one of the indexing plugins - I took advantage of this for use in our PHP frontend, EmeraldView (http://emeraldview.tourolib.org/) - Greenstone includes a Java wrapper class for Lucene which indexes documents as the collection is built - This wrapper class indexes but does not store the document full text; thus a search only returns document IDs of hits. This means that, in order to place search terms in context, we have to load the actual documents. I want the search API itself to return the surrounding text. New info: I was in fact able to hack the Java to include the full text in the index. Just a matter of adding a line of code and an if statement, once I'd been immersed in the code long enough. Trying to port it to PHP (i.e. rewrite it) was instrumental in figuring out why in the world the Greenstone indexing code is structured the way it is. -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
Ross Singer wrote: Yitzchak, are you interested in actually searching the fulltext? Or just highlighting the terms? Just in case my earlier response didn't make it crystal clear: we're trying to search the fulltext, and put the search string in context within the document which includes it. -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] web archiving - was: Implementing OpenURL for simple web resources
At Fri, 18 Sep 2009 10:40:08 -0400, Ed Summers wrote: Hi Erik, all […] I haven't been following this thread completely, but you've taken it in an interesting direction. I think you've succinctly described the issue with using URLs as references in an academic context: that the integrity of the URL is a function of time. As John Kunze has said: Just because the URI was the last to see a resource alive doesn't mean it killed them :-) I'm sure you've seen this, but Internet Archive have a nice URL pattern for referencing a resource representation in time: http://web.archive.org/web/{year}{month}{day}{hour}{minute}{seconds}/{url} So for example you can reference Google's homepage on December 2, 1998 at 23:04:10 with this URL: http://web.archive.org/web/19981202230410/http://www.google.com/ As Mike's email points out this is only good as long as Internet Archive is up and running the way we expect it to. Having any one organization shoulder this burden isn't particularly scalable, or realistic IMHO. But luckily the open and distributed nature of the web allows other organizations to do the same thing--like the great work you all are doing at the California Digital Library [1] and similar efforts like WebCite [2]. It would be kinda nice if these web archiving solutions sported similar URI patterns to enable discovery. For example it looks like: http://webarchives.cdlib.org/sw1jd4pq4k/http://books.nap.edu/html/id_questions/appB.html references a frame that surrounds an actual representation in time: http://webarchives.cdlib.org/wayback.public/NYUL_ag_3/20090320202246/http://books.nap.edu/html/id_questions/appB.html Which is quite similar to Internet Archive's URI pattern -- not surprising given the common use of Wayback [3]. But there are some differences. It might be nice to promote some URI patterns for web archiving services, so that we could theoretically create applications that federated search for a known resource at a given time. I guess in part OpenURL was designed to fill this space, but it might instead be a bit more natural to define a URI pattern that approximated what Wayback does, and come up with some way of sharing archive locations. I'm not sure if that last bit made any sense, or if some attempt at this has been made already. Maybe something to talk about at iPRES? I had hoped that the Zotero/InternetArchive collaboration would lead to some more integration between scholarly use of the web and archiving [3]. I guess there's still time? //Ed [1] http://webarchives.cdlib.org/ [2] http://www.webcitation.org/ [3] http://inkdroid.org/journal/2007/12/17/permalinks-reloaded/ Hi Ed, code4libbers - Sorry for the late reply, but I have been on vacation. Thanks for the insightful comments. They are very much in line with things I have been thinking and you have got me thinking along some other lines as well. Our system is based on crawls, so in your example sw1jd4pq4k is a crawl id. We discussed using the .../20090101.../http://.. scheme directly as in wayback, but decided to use crawl-based URLs as our primary mechanism of entry, given the constraints of our system. (By the way, the ...wayback.public... URL should not be relied on for permanence!) We would, however, like to support the use of wayback style URLs as well. There is some interest in the web archiving community of increasing interoperability between web archive systems, so that we can, for instance, direct a user to web.archive.org if we do not have a URL in our system, and vice versa. In terms of getting authors to cite archived material rather than live web material, there are many approaches to this that I can think of, for example: a) Encouraging authors to link to archive.org or other web archives rather than the live web; b) Creating services to allow authors to take snapshots of websites, like webcite, if necessary; c) Rewriting links in our system to point to archives, so that, for instance, the reference (taken from first google search for “mla website citation”, and, of course, broken): Lynch, Tim. DSN Trials and Tribble-ations Review. Psi Phi: Bradley's Science Fiction Club. 1996. Bradley University. 8 Oct. 1997 http://www.bradley.edu/campusorg/psiphi/DS9/ep/503r.html. would be rewritten to the working URL, based on the URL provided and the access time (8 Oct. 1997): http://web.archive.org/1997100800/http://www.bradley.edu/campusorg/psiphi/DS9/ep/503r.html d) Publicizing web archiving so that uses know that they can use tools like the web archive to find those broken links. e) Providing browser plugins so that users who follow 404ed links can be given the alternative of proceeding to an archived web site. best, Erik Hetzner ;; Erik Hetzner, California Digital Library ;; gnupg key id: 1024D/01DB07E3 pgpKgGuCp4dKB.pgp Description: PGP signature
Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching
AbleGrape.com is a good example of a focused search engine that aims to index only authoritative sources within a particular disciple -- in this case it's wine, enology, and viticulture. It currently crawls about 40,000 vetted websites. It's a great search engine for the subject area it serves, and it probably helped that the creator was a VP at Inktomi. Keith On Tue, Sep 29, 2009 at 10:53 AM, Cindy Harper char...@colgate.edu wrote: So that led me to speculate about a search engine that ranked just by links from .edu's, libraries sites, and a librarian-vetted list of .orgs, scholarly publishers, etc. I think you can limit by .edu in the linked-from in Google - I haven't tried that much. if anyone here has experience at using tha technique, I'd like to hear about it. But I'm thinking now about the possibility of a search engine limited to sites cooperatively vetted by librarians, that would incorporate ranking by # links. Something more responsive than cataloging websites in our catalogs. Is anyone else thinking about these ideas? or do you know of projects that approach this goal of leveraging librarian's vetting of authoritative sources?