Re: [ol-tech] discovering and linking to search inside functions for hosted text

Jonathan Rochkind Mon, 27 Sep 2010 19:34:00 -0700

Excellent, thanks, very helpful. I'll put this on my list to add "search 
inside" discovery from IA to my software.


For my purposes, the current search inside is actually quite sufficient, I 
actually don't need a search across the whole corpus. Although I'd probably use 
it if I did. 

I did notice playing around with checking the /stream/id url on various things. 
Making a head request on a /stream/id that does exist (whether it redirects or 
not) is quite fast, under 500ms.   But making a /stream/id request on an ID 
that isn't actually a valid ID at all --- is very very slow.  Like 2-3 seconds 
to return a 500 error.  

I think this may not be a problem for me, as I usually have a legit ID by other 
means so I don't think I'll ever end up requesting an invalid ID. 

But in general, it would be nice if you could make a request for a non-valid ID 
return a 404 instead of a 500, and do it pretty quickly.  That way we can start 
using just plain URLs as "REST" style APIs. 

Jonathan
________________________________________
From: [email protected] [[email protected]] On Behalf Of 
Michael Ang [[email protected]]
Sent: Monday, September 27, 2010 7:22 PM
To: [email protected]
Subject: Re: [ol-tech] discovering and linking to search inside functions for   
hosted text

  On 9/27/10 3:40 PM, Jonathan Rochkind wrote:
> Can you give me an example of a page where:
>
> http://www.archive.org/stream/[id]
>
> does NOT give you a book-reader?  The only examples I've been able to find 
> are where requesting that simply redirects you back to /details -- where, I 
> guess there is no "/stream" available.  But sometimes there's a "/stream" 
> available, but no actual bookreader?  If you could give me an example of 
> that, it would help me figure out the optimal "scraping" approach.
If the item is a text item (I assume you're only using text items, so
this should be true) and you get a normal response on /stream/[id] that
currently means you're getting the BookReader.  So for now you don't
need to parse the HTML.  I'm not aware of any anticipated changes to
that behaviour.
> I'd still really really rather use some kind of API than a scraping approach 
> like that.  I mean, it's even an "api" if you said "request /stream/id -- if 
> you get a redirect, no search inside is available, if you don't, it is".   
> But it would be even better to get it in the same API response as other OL/IA 
> queries, so I didn't need to make another HTTP request just for this.  But 
> making another HTTP request where I can just check the http status is a lot 
> better than having to sniff the page for including a specific js file, that 
> is both more expensive and seems awfully fragile.
>
Checking for the redirect as you describe should work.
> I'd suggest again that you might want to consider making discoverability of 
> this kind of thing by third party apps a priority -- I think exposing this 
> kind of thing in third party apps like mine can really increase exposure and 
> traffic to your materials.
Duly noted.  We're in the process of getting the full-text search to
actually work on openlibrary.org and inside the BookReader.  Good to
have some feedback now on integration points for 3rd-parties.

   - mang
> ________________________________________
> From: [email protected] [[email protected]] On Behalf Of 
> Michael Ang [[email protected]]
> Sent: Monday, September 27, 2010 6:10 PM
> To: [email protected]
> Subject: Re: [ol-tech] discovering and linking to search inside functions for 
>   hosted text
>
>    There are two email lists you might be interested in relating
> specifically to the BookReader:
>
> Announcements, including new releases:
> http://mail.archive.org/cgi-bin/mailman/listinfo/bookreader-announce
>
> General development:
> http://mail.archive.org/cgi-bin/mailman/listinfo/bookreader-devel
>
> On 9/27/10 3:08 PM, Michael Ang wrote:
>>     On 9/27/10 1:44 PM, Jonathan Rochkind wrote:
>>> I think I asked this question like two years ago, and the answer was
>>> "No, not yet, but we'd like that."  So I'm pinging again.
>>>
>>> Some Internet Archive/OL full text exists in a 'page turner' interface
>>> that also has 'search inside' functionality. For instance:
>>> http://www.archive.org/stream/thesetwain00bennrich#page/n5/mode/2up
>>>
>>> Using IA/OL APIs, I am already identifying internet archive ID's of
>>> interest, like say "thesetwain00bennrich".  Using that identifier, is
>>> there any way using IA/OL APIs for me to:
>>>
>>> 1) Discover if a book is available in that page-turner format (not
>>> everything is).
>> Unfortunately the logic to determine if a book can be displayed is a
>> little complicated and we don't have a proper API that exposes the result.
>>
>> In the meantime this is a little cheesy but you could fetch
>> http://www.archive.org/stream/{itemid} and look for the string
>> "BookReader.js" in the returned HTML.  That should indicate that the
>> BookReader is being served.
>>
>> That should work for all the books which we've scanned.  For user
>> uploaded text items it's a little more complicated since there is
>> usually an additional 'sub-prefix' that is also required.  Right now
>> there isn't a great way to find out the sub-prefix... we make that
>> determination by looking at the item files.xml for the files that the
>> BookReader needs (sorry).
>>
>> 2) Deep link into search results for a particular query in a particular
>>> book.
>> This already works by appending "search/{terms}" after the # in the
>> BookReader URL.
>>
>> e.g.
>> http://www.archive.org/stream/nimrodofseaorame00davirich#page/18/mode/2up/search/albatross
>>
>> This is documented here:
>> http://openlibrary.org/dev/docs/bookurls#searching
>>
>> We're working on using an improved full-text search engine instead of
>> the current rudimentary search.  This should only give better results
>> and shouldn't affect the deep-linked search URLs!
>>
>>      - mang
>>> If #1 can be taken care of, but #2 can't be because of limitations in
>>> the javascript reader, then I might try to find time to submit a patch
>>> to the javascript reader to make that possible, although I'm not sure
>>> when I'd find the time to do so.
>>>
>>> Jonathan
>>> _______________________________________________
>>> Ol-tech mailing list
>>> [email protected]
>>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
>>> To unsubscribe from this mailing list, send email to 
>>> [email protected]
>> _______________________________________________
>> Ol-tech mailing list
>> [email protected]
>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
>> To unsubscribe from this mailing list, send email to 
>> [email protected]
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] discovering and linking to search inside functions for hosted text

Reply via email to