[CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Yitzchak Schaffer

Hello,

Sorry for any cross-posting annoyance.  I have a request for a 
Greenstone collection I'm working on, to add context snippets to search 
results; for example a search for yak culture might return this in the 
list of results:


... addressing the fine points of strongyak culture/strong, the 
zoosociologists took into account ...


Sounds like a pretty basic feature, say our sponsors, and I agree.  (Ah, 
it's also an old Trac ticket at http://trac.greenstone.org/ticket/444)


I see that GS out-of-the-box is set *not* to store the fulltext in the 
index, which seems to be a prerequisite for this kind of thing, as in 
http://bit.ly/ljNkL .  Has anyone modified the Lucene indexing wrapper 
locally to do this?


Given that we don't have any Java coders on staff, I've started porting 
the Lucene wrapper to PHP for use with a custombuilder.pl and 
Zend_Search_Lucene.  I already have a PHP frontend, so adjusting that to 
display the results shouldn't be a problem; OTOH because the frontend is 
PHP, I'm restricted to using buildtype lucene, or something else with 
good PHP support.


Many thanks,

--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Erik Hatcher
The Lucene Highlighter doesn't require that the text you want  
highlighted be stored.  In fact, you can pass in any arbitrary text to  
the Highlighter.


See the various getBestFragments from the Highlighter class:
  http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/highlight/Highlighter.html 



Erik


On Sep 29, 2009, at 7:01 AM, Yitzchak Schaffer wrote:


Hello,

Sorry for any cross-posting annoyance.  I have a request for a  
Greenstone collection I'm working on, to add context snippets to  
search results; for example a search for yak culture might return  
this in the list of results:


... addressing the fine points of strongyak culture/strong, the  
zoosociologists took into account ...


Sounds like a pretty basic feature, say our sponsors, and I agree.   
(Ah, it's also an old Trac ticket at http://trac.greenstone.org/ticket/444)


I see that GS out-of-the-box is set *not* to store the fulltext in  
the index, which seems to be a prerequisite for this kind of thing,  
as in http://bit.ly/ljNkL .  Has anyone modified the Lucene indexing  
wrapper locally to do this?


Given that we don't have any Java coders on staff, I've started  
porting the Lucene wrapper to PHP for use with a custombuilder.pl  
and Zend_Search_Lucene.  I already have a PHP frontend, so adjusting  
that to display the results shouldn't be a problem; OTOH because the  
frontend is PHP, I'm restricted to using buildtype lucene, or  
something else with good PHP support.


Many thanks,

--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Yitzchak Schaffer

Erik Hatcher wrote:
The Lucene Highlighter doesn't require that the text you want 
highlighted be stored.  In fact, you can pass in any arbitrary text to 
the Highlighter.


Thanks Erik,

What I'm looking for is to return the context of the search result, not 
just the ID of the containing document - e.g. when all I input is yak 
culture, I get back the context from the document as a search result, 
without having to retrieve the doc itself:


... addressing the fine points of strongyak culture/strong, the 
zoosociologists took into account ...


GS out of the box does not appear to support this, as it does not store 
the fulltext in the index.  So yes, I can highlight stuff, but as it 
stands, I don't have the text to work with.  IANA Lucene guru, so 
correct me if I misunderstand.


--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Erik Hatcher

On Sep 29, 2009, at 7:33 AM, Yitzchak Schaffer wrote:


Erik Hatcher wrote:
The Lucene Highlighter doesn't require that the text you want  
highlighted be stored.  In fact, you can pass in any arbitrary text  
to the Highlighter.


Thanks Erik,

What I'm looking for is to return the context of the search result,  
not just the ID of the containing document - e.g. when all I input  
is yak culture, I get back the context from the document as a  
search result, without having to retrieve the doc itself:


... addressing the fine points of strongyak culture/strong, the  
zoosociologists took into account ...


GS out of the box does not appear to support this, as it does not  
store the fulltext in the index.  So yes, I can highlight stuff, but  
as it stands, I don't have the text to work with.  IANA Lucene guru,  
so correct me if I misunderstand.


I'm a bit confused then.  You mentioned that somehow Zend Lucene was  
going to help, but if you don't have the text to highlight anywhere  
then the Highlighter isn't going to be of any use.  Again, you don't  
need the full text in the Lucene index, but you do need it get it from  
somewhere in order to be able to highlight it.


Erik


[CODE4LIB] Bookmarking web links - authoritativeness or focused searching

2009-09-29 Thread Cindy Harper
I've been thinking about the role of libraries as promoter of authoritative
works - helping to select and sort the plethora of information out there.
And I heard another presentation about social media this morning.  So I
though I'd bring up for discussion here some of the ideas I've been mulling
over.

Last week I sent this message to the Suggestions and Ideas forum at
delicious.
http://support.delicious.com/forum/comments.php?DiscussionID=3237page=1#Item_0
The basic idea is to develop a delicious network of librarians. Or a network
of faculty members.  Then have one login whose network included those users,
and share that login so that lots of people could share that network.
Delicious responded that we could have a wiki where people posted their
delicious names so that others could add them to their personal networks,
but that doesn't scale up very well.

Or another project I've toyed with, involving focused searching:  I started
with Robert Teeter's index to Great Books lists.
http://www.interleaves.org/~rteeter/grtalphaa.htmlhttp://www.interleaves.org/%7Erteeter/grtalphaa.html.
I've almost completed pulling them into a MySQL database so that I could
sort the titles by the number of Great Books lists that mention each title.
Then I thought about how one could do focused searching of the web,
collecting pages with a title containing (best and books) or (great and
books), and screen scraping title lists (you'd have to have some heuristic
method of identifying the data, of course, and I'm aware what problems might
arise there).  But my test searches in that idea showed that one runs into a
lot of commercial ephemeral lists and spurious lists.  Now, you could rely
on crowd-sourcing to filter out the consensus by ranking by the number of
sites/cites.  But I thought you might want to differentiate between the
source - .edus, librarys, etc.

So that led me to speculate about a search engine that ranked just by links
from .edu's, libraries sites, and a librarian-vetted list of .orgs,
scholarly publishers, etc.  I think you can limit by .edu in the linked-from
in Google - I haven't tried that much. if anyone here has experience at
using tha technique, I'd like to hear about it.  But I'm thinking now about
the possibility of a search engine limited to sites cooperatively vetted by
librarians, that would incorporate ranking by # links.  Something more
responsive than cataloging websites in our catalogs.

Is anyone else thinking about these ideas?  or do you know of projects that
approach this goal of leveraging librarian's vetting of authoritative
sources?




Cindy Harper, Systems Librarian
Colgate University Libraries
char...@colgate.edu
315-228-7363


Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching

2009-09-29 Thread Kent Gerber
www.diigo.com is a social bookmarking site like delicious and it has added 
features like creating groups around specific themes and the ability to 
annotate the Web pages you bookmark for future reference.  You might want to 
explore this feature and see if it is appropriate for what you envision. 

Kent

Kent Gerber, MSLIS
Digital Library Manager
Bethel University
St. Paul, MN
phone: 651.638.6937
email: kent-ger...@bethel.edu




-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Cindy 
Harper
Sent: Tuesday, September 29, 2009 9:54 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Bookmarking web links - authoritativeness or focused 
searching

I've been thinking about the role of libraries as promoter of authoritative
works - helping to select and sort the plethora of information out there.
And I heard another presentation about social media this morning.  So I
though I'd bring up for discussion here some of the ideas I've been mulling
over.

Last week I sent this message to the Suggestions and Ideas forum at
delicious.
http://support.delicious.com/forum/comments.php?DiscussionID=3237page=1#Item_0
The basic idea is to develop a delicious network of librarians. Or a network
of faculty members.  Then have one login whose network included those users,
and share that login so that lots of people could share that network.
Delicious responded that we could have a wiki where people posted their
delicious names so that others could add them to their personal networks,
but that doesn't scale up very well.

Or another project I've toyed with, involving focused searching:  I started
with Robert Teeter's index to Great Books lists.
http://www.interleaves.org/~rteeter/grtalphaa.htmlhttp://www.interleaves.org/%7Erteeter/grtalphaa.html.
I've almost completed pulling them into a MySQL database so that I could
sort the titles by the number of Great Books lists that mention each title.
Then I thought about how one could do focused searching of the web,
collecting pages with a title containing (best and books) or (great and
books), and screen scraping title lists (you'd have to have some heuristic
method of identifying the data, of course, and I'm aware what problems might
arise there).  But my test searches in that idea showed that one runs into a
lot of commercial ephemeral lists and spurious lists.  Now, you could rely
on crowd-sourcing to filter out the consensus by ranking by the number of
sites/cites.  But I thought you might want to differentiate between the
source - .edus, librarys, etc.

So that led me to speculate about a search engine that ranked just by links
from .edu's, libraries sites, and a librarian-vetted list of .orgs,
scholarly publishers, etc.  I think you can limit by .edu in the linked-from
in Google - I haven't tried that much. if anyone here has experience at
using tha technique, I'd like to hear about it.  But I'm thinking now about
the possibility of a search engine limited to sites cooperatively vetted by
librarians, that would incorporate ranking by # links.  Something more
responsive than cataloging websites in our catalogs.

Is anyone else thinking about these ideas?  or do you know of projects that
approach this goal of leveraging librarian's vetting of authoritative
sources?




Cindy Harper, Systems Librarian
Colgate University Libraries
char...@colgate.edu
315-228-7363


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Yitzchak Schaffer

Erik Hatcher wrote:
I'm a bit confused then.  You mentioned that somehow Zend Lucene was 
going to help, but if you don't have the text to highlight anywhere then 
the Highlighter isn't going to be of any use.  Again, you don't need the 
full text in the Lucene index, but you do need it get it from somewhere 
in order to be able to highlight it.


Erik,

I started to port the native Greenstone Java Lucene wrapper to PHP, so I 
could then modify it to add this feature, as I don't know Java.  This 
would mean using Zend Lucene for the actual indexing implementation.  My 
question is whether anyone's already done it, in Java or otherwise.


Thanks for the clarification,

--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching

2009-09-29 Thread Jason Griffey
It's not social bookmarking, but as far as But I'm thinking now about
the possibility of a search engine limited to sites cooperatively vetted by
librarians, that would incorporate ranking by # links.  Something more
responsive than cataloging websites in our catalogs., well, that's almost
exactly what lii.org is.

http://lii.org

I happen to think that authority is dead dead dead as a method of measuring
information worth, but that's just me. :-)

Jason

On Tue, Sep 29, 2009 at 10:53 AM, Cindy Harper char...@colgate.edu wrote:

 I've been thinking about the role of libraries as promoter of authoritative
 works - helping to select and sort the plethora of information out there.
 And I heard another presentation about social media this morning.  So I
 though I'd bring up for discussion here some of the ideas I've been mulling
 over.

 Last week I sent this message to the Suggestions and Ideas forum at
 delicious.

 http://support.delicious.com/forum/comments.php?DiscussionID=3237page=1#Item_0
 The basic idea is to develop a delicious network of librarians. Or a
 network
 of faculty members.  Then have one login whose network included those
 users,
 and share that login so that lots of people could share that network.
 Delicious responded that we could have a wiki where people posted their
 delicious names so that others could add them to their personal networks,
 but that doesn't scale up very well.

 Or another project I've toyed with, involving focused searching:  I started
 with Robert Teeter's index to Great Books lists.
 http://www.interleaves.org/~rteeter/grtalphaa.htmlhttp://www.interleaves.org/%7Erteeter/grtalphaa.html
 http://www.interleaves.org/%7Erteeter/grtalphaa.html.
 I've almost completed pulling them into a MySQL database so that I could
 sort the titles by the number of Great Books lists that mention each title.
 Then I thought about how one could do focused searching of the web,
 collecting pages with a title containing (best and books) or (great and
 books), and screen scraping title lists (you'd have to have some heuristic
 method of identifying the data, of course, and I'm aware what problems
 might
 arise there).  But my test searches in that idea showed that one runs into
 a
 lot of commercial ephemeral lists and spurious lists.  Now, you could rely
 on crowd-sourcing to filter out the consensus by ranking by the number of
 sites/cites.  But I thought you might want to differentiate between the
 source - .edus, librarys, etc.

 So that led me to speculate about a search engine that ranked just by links
 from .edu's, libraries sites, and a librarian-vetted list of .orgs,
 scholarly publishers, etc.  I think you can limit by .edu in the
 linked-from
 in Google - I haven't tried that much. if anyone here has experience at
 using tha technique, I'd like to hear about it.  But I'm thinking now about
 the possibility of a search engine limited to sites cooperatively vetted by
 librarians, that would incorporate ranking by # links.  Something more
 responsive than cataloging websites in our catalogs.

 Is anyone else thinking about these ideas?  or do you know of projects that
 approach this goal of leveraging librarian's vetting of authoritative
 sources?




 Cindy Harper, Systems Librarian
 Colgate University Libraries
 char...@colgate.edu
 315-228-7363




-- 
Follow me on Twitter! http://www.twitter.com/griffey


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Ross Singer
Yitzchak, are you interested in actually searching the fulltext?  Or just
highlighting the terms?

If you're only interested in highlighting it, it might be a whole lot easier
to implement this in javascript through something like jQuery:

http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html

That way you're not juggling mostly redundant Lucene indexes and trying to
keep them synced.

How are you getting your search results?  Does Greenstone have some sort of
search API that returns the highlighted results?  Would it make a difference
if you could add a field to the Lucene document (meaning would you have
access to it through your PHP API to Greenstone)?  If so, you could probably
do this pretty easily via one of the JVM scripting languages (Groovy, JRuby,
Jython, Quercus -- PHP in the JVM) so you just have the single Lucene index
instead of multiple.

Another approach might be to serve the Lucene index via Solr [1] or
Lucene-WS (http://lucene-ws.net/) which would allow you to skip Greenstone
altogether for searching.

Basically, I would try to avoid going the Zend_Lucene route if at all
possible.

-Ross.

1.
http://www.google.com/search?q=solr+on+an+existing+lucene+indexie=utf-8oe=utf-8aq=trls=org.mozilla:en-US:officialclient=firefox-a

On Tue, Sep 29, 2009 at 11:32 AM, Yitzchak Schaffer 
yitzchak.schaf...@gmx.com wrote:

 Erik Hatcher wrote:

 I'm a bit confused then.  You mentioned that somehow Zend Lucene was going
 to help, but if you don't have the text to highlight anywhere then the
 Highlighter isn't going to be of any use.  Again, you don't need the full
 text in the Lucene index, but you do need it get it from somewhere in order
 to be able to highlight it.


 Erik,

 I started to port the native Greenstone Java Lucene wrapper to PHP, so I
 could then modify it to add this feature, as I don't know Java.  This would
 mean using Zend Lucene for the actual indexing implementation.  My question
 is whether anyone's already done it, in Java or otherwise.

 Thanks for the clarification,


 --
 Yitzchak Schaffer
 Systems Manager
 Touro College Libraries
 33 West 23rd Street
 New York, NY 10010
 Tel (212) 463-0400 x5230
 Fax (212) 627-3197
 Email yitzchak.schaf...@gmx.com



Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching

2009-09-29 Thread MJ Ray
Cindy Harper wrote:
 I've been thinking about the role of libraries as promoter of authoritative
 works - helping to select and sort the plethora of information out there.
 And I heard another presentation about social media this morning.  So I
 though I'd bring up for discussion here some of the ideas I've been mulling
 over. [...]
 Is anyone else thinking about these ideas?  or do you know of projects that
 approach this goal of leveraging librarian's vetting of authoritative
 sources?

The big problem with social media sites is that they tend towards
privatising our data.  Any solution needs to be both FOSS and
Open Data to overcome that.

Some of the veterans here will probably remember the ODP (dmoz.org)
and VLib.org catalogues.  Can we build on them instead of inventing
another wheel?

Thanks,
-- 
MJ Ray (slef)  LMS developer and webmaster at | software
www.software.coop http://mjr.towers.org.uk|   co
IMO only: see http://mjr.towers.org.uk/email.html |   op


Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching

2009-09-29 Thread Joe Hourcle

On Tue, 29 Sep 2009, Cindy Harper wrote:


I've been thinking about the role of libraries as promoter of authoritative
works - helping to select and sort the plethora of information out there.
And I heard another presentation about social media this morning.  So I
though I'd bring up for discussion here some of the ideas I've been mulling
over.


[trimmed]


Is anyone else thinking about these ideas?  or do you know of projects that
approach this goal of leveraging librarian's vetting of authoritative
sources?


I don't know of any projects that specifically do what you've mentioned, 
but for the last few years, we've been mulling over how to store various 
lists and catalogs so that we could present interesting intersections of 
them.


In my case, I deal with scientific catalogs, so it's stuff like when was 
RHESSI observing the same area as TRACE? or When was there an X-class 
flare within 2 hours of a CME? or even lack of intersections When were 
there type-II radio bursts without a CME or flare within 6 hours?


For the science catalogs, we specifically don't want to just make some 
sort of single ranking from each list, and it's not really easy to merge 
the catalogs into some form of union catalog as they're cataloging 
different concepts.


... and I think that there's use in library searches to keep the catalogs 
different, particularly when you're bringing up authority (which then gets 
to reputation, etc.).


I'm not sure how many other people out there would try to search for Hugo 
award winning novels that weren't on the New York Times best seller list, 
so it might not be as useful for general patron use ... unless you could 
give it your *own* catalog (AFI top 100 movies ... that I don't already 
own)



-
Joe Hourcle
Solar Data Analysis Center
Goddard Space Flight Center


Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching

2009-09-29 Thread Donahue, Amy (NIH/NLM) [C]
I feel like a couple years ago a librarian(s?) created a Google Custom Search 
Engine that did exactly what you describe as focused searching, but I can't 
find a link any more.  You can search the CSEs by scrolling down on this page 
(and there are a couple of links to directories, too): 
http://www.lib.berkeley.edu/find/types/websites.html

Also, Mike Eisenberg over at the University of Washington was working on that 
kind of problem with some other groups...A quick search reveals that it's now 
called Reference Extract and it's being done in conjunction with Syracuse 
University (and OCLC is somehow involved).  
http://chronicle.com/blogPost/Librarians-Want-to-Out-Google/4365

~Amy

From: Cindy Harper [char...@colgate.edu]
Sent: Tuesday, September 29, 2009 10:53 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Bookmarking web links - authoritativeness or focused 
searching

I've been thinking about the role of libraries as promoter of authoritative
works - helping to select and sort the plethora of information out there.
And I heard another presentation about social media this morning.  So I
though I'd bring up for discussion here some of the ideas I've been mulling
over.

Last week I sent this message to the Suggestions and Ideas forum at
delicious.
http://support.delicious.com/forum/comments.php?DiscussionID=3237page=1#Item_0
The basic idea is to develop a delicious network of librarians. Or a network
of faculty members.  Then have one login whose network included those users,
and share that login so that lots of people could share that network.
Delicious responded that we could have a wiki where people posted their
delicious names so that others could add them to their personal networks,
but that doesn't scale up very well.

Or another project I've toyed with, involving focused searching:  I started
with Robert Teeter's index to Great Books lists.
http://www.interleaves.org/~rteeter/grtalphaa.htmlhttp://www.interleaves.org/%7Erteeter/grtalphaa.html.
I've almost completed pulling them into a MySQL database so that I could
sort the titles by the number of Great Books lists that mention each title.
Then I thought about how one could do focused searching of the web,
collecting pages with a title containing (best and books) or (great and
books), and screen scraping title lists (you'd have to have some heuristic
method of identifying the data, of course, and I'm aware what problems might
arise there).  But my test searches in that idea showed that one runs into a
lot of commercial ephemeral lists and spurious lists.  Now, you could rely
on crowd-sourcing to filter out the consensus by ranking by the number of
sites/cites.  But I thought you might want to differentiate between the
source - .edus, librarys, etc.

So that led me to speculate about a search engine that ranked just by links
from .edu's, libraries sites, and a librarian-vetted list of .orgs,
scholarly publishers, etc.  I think you can limit by .edu in the linked-from
in Google - I haven't tried that much. if anyone here has experience at
using tha technique, I'd like to hear about it.  But I'm thinking now about
the possibility of a search engine limited to sites cooperatively vetted by
librarians, that would incorporate ranking by # links.  Something more
responsive than cataloging websites in our catalogs.

Is anyone else thinking about these ideas?  or do you know of projects that
approach this goal of leveraging librarian's vetting of authoritative
sources?




Cindy Harper, Systems Librarian
Colgate University Libraries
char...@colgate.edu
315-228-7363


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Yitzchak Schaffer

Ross Singer wrote:

Yitzchak, are you interested in actually searching the fulltext?  Or just
highlighting the terms?


Sorry this wasn't clearer.  Let me re-summarize, and report on a new 
development:


- Greenstone allows for Lucene as one of the indexing plugins

- I took advantage of this for use in our PHP frontend, EmeraldView 
(http://emeraldview.tourolib.org/)


- Greenstone includes a Java wrapper class for Lucene which indexes 
documents as the collection is built


- This wrapper class indexes but does not store the document full text; 
thus a search only returns document IDs of hits.  This means that, in 
order to place search terms in context, we have to load the actual 
documents.  I want the search API itself to return the surrounding text.


New info:

I was in fact able to hack the Java to include the full text in the 
index.  Just a matter of adding a line of code and an if statement, 
once I'd been immersed in the code long enough.  Trying to port it to 
PHP (i.e. rewrite it) was instrumental in figuring out why in the world 
the Greenstone indexing code is structured the way it is.


--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Yitzchak Schaffer

Ross Singer wrote:

Yitzchak, are you interested in actually searching the fulltext?  Or just
highlighting the terms?


Just in case my earlier response didn't make it crystal clear: we're 
trying to search the fulltext, and put the search string in context 
within the document which includes it.


--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] web archiving - was: Implementing OpenURL for simple web resources

2009-09-29 Thread Erik Hetzner
At Fri, 18 Sep 2009 10:40:08 -0400,
Ed Summers wrote:
 
 Hi Erik, all

 […]

 I haven't been following this thread completely, but you've taken it
 in an interesting direction. I think you've succinctly described the
 issue with using URLs as references in an academic context: that the
 integrity of the URL is a function of time. As John Kunze has said:
 Just because the URI was the last to see a resource alive doesn't
 mean it killed them :-)
 
 I'm sure you've seen this, but Internet Archive have a nice URL
 pattern for referencing a resource representation in time:
 
   http://web.archive.org/web/{year}{month}{day}{hour}{minute}{seconds}/{url}
 
 So for example you can reference Google's homepage on December 2, 1998
 at 23:04:10 with this URL:
 
   http://web.archive.org/web/19981202230410/http://www.google.com/
 
 As Mike's email points out this is only good as long as Internet
 Archive is up and running the way we expect it to. Having any one
 organization shoulder this burden isn't particularly scalable, or
 realistic IMHO. But luckily the open and distributed nature of the
 web allows other organizations to do the same thing--like the great
 work you all are doing at the California Digital Library [1] and
 similar efforts like WebCite [2]. It would be kinda nice if these
 web archiving solutions sported similar URI patterns to enable
 discovery. For example it looks like:
 
   
 http://webarchives.cdlib.org/sw1jd4pq4k/http://books.nap.edu/html/id_questions/appB.html
 
 references a frame that surrounds an actual representation in time:
 
   
 http://webarchives.cdlib.org/wayback.public/NYUL_ag_3/20090320202246/http://books.nap.edu/html/id_questions/appB.html
 
 Which is quite similar to Internet Archive's URI pattern -- not
 surprising given the common use of Wayback [3]. But there are some
 differences. It might be nice to promote some URI patterns for web
 archiving services, so that we could theoretically create
 applications that federated search for a known resource at a given
 time. I guess in part OpenURL was designed to fill this space, but
 it might instead be a bit more natural to define a URI pattern that
 approximated what Wayback does, and come up with some way of sharing
 archive locations. I'm not sure if that last bit made any sense, or
 if some attempt at this has been made already. Maybe something to
 talk about at iPRES?
 
 I had hoped that the Zotero/InternetArchive collaboration would lead
 to some more integration between scholarly use of the web and
 archiving [3]. I guess there's still time?
 
 //Ed
 
 [1] http://webarchives.cdlib.org/
 [2] http://www.webcitation.org/
 [3] http://inkdroid.org/journal/2007/12/17/permalinks-reloaded/

Hi Ed, code4libbers -

Sorry for the late reply, but I have been on vacation.

Thanks for the insightful comments. They are very much in line with
things I have been thinking and you have got me thinking along some
other lines as well.

Our system is based on crawls, so in your example sw1jd4pq4k is a
crawl id. We discussed using the .../20090101.../http://.. scheme
directly as in wayback, but decided to use crawl-based URLs as our
primary mechanism of entry, given the constraints of our system.

(By the way, the ...wayback.public... URL should not be relied on
for permanence!)

We would, however, like to support the use of wayback style URLs as
well. There is some interest in the web archiving community of
increasing interoperability between web archive systems, so that we
can, for instance, direct a user to web.archive.org if we do not have
a URL in our system, and vice versa.

In terms of getting authors to cite archived material rather than live
web material, there are many approaches to this that I can think of,
for example:

a) Encouraging authors to link to archive.org or other web archives
rather than the live web;

b) Creating services to allow authors to take snapshots of websites,
like webcite, if necessary;

c) Rewriting links in our system to point to archives, so that, for
instance, the reference (taken from first google search for “mla
website citation”, and, of course, broken):

Lynch, Tim. DSN Trials and Tribble-ations Review. Psi Phi: Bradley's
Science Fiction Club. 1996. Bradley University. 8 Oct. 1997
http://www.bradley.edu/campusorg/psiphi/DS9/ep/503r.html.

would be rewritten to the working URL, based on the URL provided and
the access time (8 Oct. 1997):

http://web.archive.org/1997100800/http://www.bradley.edu/campusorg/psiphi/DS9/ep/503r.html

d) Publicizing web archiving so that uses know that they can use tools
like the web archive to find those broken links.

e) Providing browser plugins so that users who follow 404ed links can
be given the alternative of proceeding to an archived web site.

best,
Erik Hetzner
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3


pgpKgGuCp4dKB.pgp
Description: PGP signature


Re: [CODE4LIB] Bookmarking web links - authoritativeness or focused searching

2009-09-29 Thread Keith Jenkins
AbleGrape.com is a good example of a focused search engine that aims
to index only authoritative sources within a particular disciple --
in this case it's wine, enology, and viticulture.  It currently crawls
about 40,000 vetted websites.

It's a great search engine for the subject area it serves, and it
probably helped that the creator was a VP at Inktomi.

Keith


On Tue, Sep 29, 2009 at 10:53 AM, Cindy Harper char...@colgate.edu wrote:
 So that led me to speculate about a search engine that ranked just by links
 from .edu's, libraries sites, and a librarian-vetted list of .orgs,
 scholarly publishers, etc.  I think you can limit by .edu in the linked-from
 in Google - I haven't tried that much. if anyone here has experience at
 using tha technique, I'd like to hear about it.  But I'm thinking now about
 the possibility of a search engine limited to sites cooperatively vetted by
 librarians, that would incorporate ranking by # links.  Something more
 responsive than cataloging websites in our catalogs.

 Is anyone else thinking about these ideas?  or do you know of projects that
 approach this goal of leveraging librarian's vetting of authoritative
 sources?