Excellent, thanks, very helpful. I'll put this on my list to add "search 
inside" discovery from IA to my software. 

For my purposes, the current search inside is actually quite sufficient, I 
actually don't need a search across the whole corpus. Although I'd probably use 
it if I did. 

I did notice playing around with checking the /stream/id url on various things. 
Making a head request on a /stream/id that does exist (whether it redirects or 
not) is quite fast, under 500ms.   But making a /stream/id request on an ID 
that isn't actually a valid ID at all --- is very very slow.  Like 2-3 seconds 
to return a 500 error.  

I think this may not be a problem for me, as I usually have a legit ID by other 
means so I don't think I'll ever end up requesting an invalid ID. 

But in general, it would be nice if you could make a request for a non-valid ID 
return a 404 instead of a 500, and do it pretty quickly.  That way we can start 
using just plain URLs as "REST" style APIs. 

Jonathan
________________________________________
From: [email protected] [[email protected]] On Behalf Of 
Michael Ang [[email protected]]
Sent: Monday, September 27, 2010 7:22 PM
To: [email protected]
Subject: Re: [ol-tech] discovering and linking to search inside functions for   
hosted text

  On 9/27/10 3:40 PM, Jonathan Rochkind wrote:
> Can you give me an example of a page where:
>
> http://www.archive.org/stream/[id]
>
> does NOT give you a book-reader?  The only examples I've been able to find 
> are where requesting that simply redirects you back to /details -- where, I 
> guess there is no "/stream" available.  But sometimes there's a "/stream" 
> available, but no actual bookreader?  If you could give me an example of 
> that, it would help me figure out the optimal "scraping" approach.
If the item is a text item (I assume you're only using text items, so
this should be true) and you get a normal response on /stream/[id] that
currently means you're getting the BookReader.  So for now you don't
need to parse the HTML.  I'm not aware of any anticipated changes to
that behaviour.
> I'd still really really rather use some kind of API than a scraping approach 
> like that.  I mean, it's even an "api" if you said "request /stream/id -- if 
> you get a redirect, no search inside is available, if you don't, it is".   
> But it would be even better to get it in the same API response as other OL/IA 
> queries, so I didn't need to make another HTTP request just for this.  But 
> making another HTTP request where I can just check the http status is a lot 
> better than having to sniff the page for including a specific js file, that 
> is both more expensive and seems awfully fragile.
>
Checking for the redirect as you describe should work.
> I'd suggest again that you might want to consider making discoverability of 
> this kind of thing by third party apps a priority -- I think exposing this 
> kind of thing in third party apps like mine can really increase exposure and 
> traffic to your materials.
Duly noted.  We're in the process of getting the full-text search to
actually work on openlibrary.org and inside the BookReader.  Good to
have some feedback now on integration points for 3rd-parties.

   - mang
> ________________________________________
> From: [email protected] [[email protected]] On Behalf Of 
> Michael Ang [[email protected]]
> Sent: Monday, September 27, 2010 6:10 PM
> To: [email protected]
> Subject: Re: [ol-tech] discovering and linking to search inside functions for 
>   hosted text
>
>    There are two email lists you might be interested in relating
> specifically to the BookReader:
>
> Announcements, including new releases:
> http://mail.archive.org/cgi-bin/mailman/listinfo/bookreader-announce
>
> General development:
> http://mail.archive.org/cgi-bin/mailman/listinfo/bookreader-devel
>
> On 9/27/10 3:08 PM, Michael Ang wrote:
>>     On 9/27/10 1:44 PM, Jonathan Rochkind wrote:
>>> I think I asked this question like two years ago, and the answer was
>>> "No, not yet, but we'd like that."  So I'm pinging again.
>>>
>>> Some Internet Archive/OL full text exists in a 'page turner' interface
>>> that also has 'search inside' functionality. For instance:
>>> http://www.archive.org/stream/thesetwain00bennrich#page/n5/mode/2up
>>>
>>> Using IA/OL APIs, I am already identifying internet archive ID's of
>>> interest, like say "thesetwain00bennrich".  Using that identifier, is
>>> there any way using IA/OL APIs for me to:
>>>
>>> 1) Discover if a book is available in that page-turner format (not
>>> everything is).
>> Unfortunately the logic to determine if a book can be displayed is a
>> little complicated and we don't have a proper API that exposes the result.
>>
>> In the meantime this is a little cheesy but you could fetch
>> http://www.archive.org/stream/{itemid} and look for the string
>> "BookReader.js" in the returned HTML.  That should indicate that the
>> BookReader is being served.
>>
>> That should work for all the books which we've scanned.  For user
>> uploaded text items it's a little more complicated since there is
>> usually an additional 'sub-prefix' that is also required.  Right now
>> there isn't a great way to find out the sub-prefix... we make that
>> determination by looking at the item files.xml for the files that the
>> BookReader needs (sorry).
>>
>> 2) Deep link into search results for a particular query in a particular
>>> book.
>> This already works by appending "search/{terms}" after the # in the
>> BookReader URL.
>>
>> e.g.
>> http://www.archive.org/stream/nimrodofseaorame00davirich#page/18/mode/2up/search/albatross
>>
>> This is documented here:
>> http://openlibrary.org/dev/docs/bookurls#searching
>>
>> We're working on using an improved full-text search engine instead of
>> the current rudimentary search.  This should only give better results
>> and shouldn't affect the deep-linked search URLs!
>>
>>      - mang
>>> If #1 can be taken care of, but #2 can't be because of limitations in
>>> the javascript reader, then I might try to find time to submit a patch
>>> to the javascript reader to make that possible, although I'm not sure
>>> when I'd find the time to do so.
>>>
>>> Jonathan
>>> _______________________________________________
>>> Ol-tech mailing list
>>> [email protected]
>>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
>>> To unsubscribe from this mailing list, send email to 
>>> [email protected]
>> _______________________________________________
>> Ol-tech mailing list
>> [email protected]
>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
>> To unsubscribe from this mailing list, send email to 
>> [email protected]
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to