Re: [CODE4LIB] OregonDigital's BookReader integration (was: Listserv communication)

2016-03-04 Thread Schuyler Lindberg
Hi all,

In response to Laura's comment, I thought I'd share that at UBC we've
included the 'direct-to-page-link' functionality in our Open Collections
search interface
.
It is not loaded by default (you must select the 'detailed view' option, or
click to expand a particular result) because, as Chad mentioned, it has
quite a bit of overhead and in our testing only some (very vocal) users
consistently clicked the links. We use ElasticSearch, but it works much the
same way Josh described: Firing additional queries for each 'compound
object' to search the page-level full text metadata.

-Schuyler



On Wed, Mar 2, 2016 at 10:00 AM, Laura Buchholz 
wrote:

> Thanks guys, and thank you Shaun, for following up. This is exactly what I
> was hoping to learn.
>
> I have to admit I'm surprised that the "direct-to-page-link" functionality
> isn't more common in the newer/inspiring digital collections. It exists in
> contentDM (not saying that is reason it should continue to exist), and
> seems intuitively useful. We're planning on doing some usability testing
> soon, and I'm going to try to get feedback on this feature.
>
> On Tue, Mar 1, 2016 at 7:51 AM, Gum, Josh 
> wrote:
>
> > Shaun,
> >
> > Thanks, I’m psyched to be at OSU!
> >
> > I think you’ve nailed down the process here, and there are a couple
> > concepts that I wanted to follow-up on;
> >
> > 1. “Download document from search results list” : This would be a simple
> > enhancement to the rendering of each search result and exposing the
> > download link.. The software has access to all of the necessary values
> > (document ID, and how to generate a “downloads” link for it) at render
> > time, so adding a new link should be trivial.. It seems like it would be
> a
> > good enhancement to me.
> >
> > 2. “Direct-to-page link” : Generating a link to guide a PDF reader to a
> > specific page [1] seems easy, although I’m not sure that every reader
> would
> > work the same. So the missing piece is being able to associate a SOLR hit
> > with the page it was found in the PDF.. So, I think you’re right about
> > needing to index each page individually in order to facilitate rendering
> a
> > link to a specific page related to the search result hit being rendered
> on
> > the page.
> >
> > I can’t speak to the history behind implementing the search the way it is
> > right now.. But it does seem like both of these concepts would be great
> > additions to the next installment of OregonDigital!
> >
> > [1] http://oregondigital.org/downloads/oregondigital:df66z508t?page=3
> >
> > ———
> > Josh Gum
> > Oregon State University Libraries and Press
> >
> >
> >
> >
> >
> > On 2/29/16, 4:13 PM, "Code for Libraries on behalf of Shaun D. Ellis" <
> > CODE4LIB@LISTSERV.ND.EDU on behalf of sha...@princeton.edu> wrote:
> >
> > >Josh,
> > >Congrats on the new gig, and thank you for this explanation of
> > OregonDigital’s BookReader integration.  I’m sorry I wasn’t more specific
> > about this, but I think the original question had less to do with the
> > BookReader integration, and more to do with a non-frameworky explanation
> of
> > configuring Solr to return direct links to pages where the keywords
> appear
> > in a “compound” object, such as a book.
> > >
> > >As the original poster (Laura Buchholz) mentioned, it seems like
> > OregonDigital does not provide direct links until after the BookReader is
> > loaded.  It’s only then that pins are placed on the “slider nav” to
> > indicate where the keyword appears.  So, to answer the original question,
> > it seems like all the full-text may be dumped into a single Solr field
> that
> > returns the object in the initial search result, and then upon loading
> the
> > BookReader makes a subsequent query (limited to that one object) retrieve
> > the “data payload” in your example to then locate the exact pages where
> the
> > terms appear?  Is that what’s going on there?
> > >
> > >I suppose if you wanted to return all the page numbers in the original
> > search query, you may have to send each page individually to Solr to be
> > indexed, and if you have a viewer with conventions for "deep linking"
> (like
> > the BookReader has) you could generate the link for each page and index
> it
> > to provide this functionality.
> > >
> > >I was curious as folks were posting all the inspiring digital
> collections
> > sites earlier today, so I looked for this pattern but didn’t see it.
> Most
> > of the apps use the same pattern as OregonDigital (although my testing
> was
> > not particularly thorough, so let me know if I’m wrong, folks!).  On the
> > otherhand, you do see the "direct-to-page link" interface with both
> Amazon
> > and Google Books search, which takes you directly to the page from the
> > initial search results.
> > >
> > >So, I’m not sure if this was a conscious design decision on the part of
> > library 

Re: [CODE4LIB] OregonDigital's BookReader integration (was: Listserv communication)

2016-03-02 Thread Laura Buchholz
Thanks guys, and thank you Shaun, for following up. This is exactly what I
was hoping to learn.

I have to admit I'm surprised that the "direct-to-page-link" functionality
isn't more common in the newer/inspiring digital collections. It exists in
contentDM (not saying that is reason it should continue to exist), and
seems intuitively useful. We're planning on doing some usability testing
soon, and I'm going to try to get feedback on this feature.

On Tue, Mar 1, 2016 at 7:51 AM, Gum, Josh  wrote:

> Shaun,
>
> Thanks, I’m psyched to be at OSU!
>
> I think you’ve nailed down the process here, and there are a couple
> concepts that I wanted to follow-up on;
>
> 1. “Download document from search results list” : This would be a simple
> enhancement to the rendering of each search result and exposing the
> download link.. The software has access to all of the necessary values
> (document ID, and how to generate a “downloads” link for it) at render
> time, so adding a new link should be trivial.. It seems like it would be a
> good enhancement to me.
>
> 2. “Direct-to-page link” : Generating a link to guide a PDF reader to a
> specific page [1] seems easy, although I’m not sure that every reader would
> work the same. So the missing piece is being able to associate a SOLR hit
> with the page it was found in the PDF.. So, I think you’re right about
> needing to index each page individually in order to facilitate rendering a
> link to a specific page related to the search result hit being rendered on
> the page.
>
> I can’t speak to the history behind implementing the search the way it is
> right now.. But it does seem like both of these concepts would be great
> additions to the next installment of OregonDigital!
>
> [1] http://oregondigital.org/downloads/oregondigital:df66z508t?page=3
>
> ———
> Josh Gum
> Oregon State University Libraries and Press
>
>
>
>
>
> On 2/29/16, 4:13 PM, "Code for Libraries on behalf of Shaun D. Ellis" <
> CODE4LIB@LISTSERV.ND.EDU on behalf of sha...@princeton.edu> wrote:
>
> >Josh,
> >Congrats on the new gig, and thank you for this explanation of
> OregonDigital’s BookReader integration.  I’m sorry I wasn’t more specific
> about this, but I think the original question had less to do with the
> BookReader integration, and more to do with a non-frameworky explanation of
> configuring Solr to return direct links to pages where the keywords appear
> in a “compound” object, such as a book.
> >
> >As the original poster (Laura Buchholz) mentioned, it seems like
> OregonDigital does not provide direct links until after the BookReader is
> loaded.  It’s only then that pins are placed on the “slider nav” to
> indicate where the keyword appears.  So, to answer the original question,
> it seems like all the full-text may be dumped into a single Solr field that
> returns the object in the initial search result, and then upon loading the
> BookReader makes a subsequent query (limited to that one object) retrieve
> the “data payload” in your example to then locate the exact pages where the
> terms appear?  Is that what’s going on there?
> >
> >I suppose if you wanted to return all the page numbers in the original
> search query, you may have to send each page individually to Solr to be
> indexed, and if you have a viewer with conventions for "deep linking" (like
> the BookReader has) you could generate the link for each page and index it
> to provide this functionality.
> >
> >I was curious as folks were posting all the inspiring digital collections
> sites earlier today, so I looked for this pattern but didn’t see it.  Most
> of the apps use the same pattern as OregonDigital (although my testing was
> not particularly thorough, so let me know if I’m wrong, folks!).  On the
> otherhand, you do see the "direct-to-page link" interface with both Amazon
> and Google Books search, which takes you directly to the page from the
> initial search results.
> >
> >So, I’m not sure if this was a conscious design decision on the part of
> library digital collections creators, if the pattern is followed because
> it’s considered a “best practice” or a “convention” in our field, or if it
> was just simpler to implement.
> >
> >Thanks again for the follow up,
> >Shaun
> >
> >> On Feb 26, 2016, at 2:51 PM, Gum, Josh 
> wrote:
> >>
> >> I’m very new (<1 month) to Oregon State University, library technology,
> and Code4Lib. So please bear with me. Also, I’m going to put a disclaimer
> out that I may be missing some of the picture here.. I’m willing to lend a
> hand digging into more details if needed, so please feel free to ask.
> >>
> >> Also.. I’m going to split this part of the discussion into a separate
> thread, so we can address the question regarding the OregonDigital
> BookReader integration. I’ve done some digging this morning, and spoke to a
> colleague who took part in some of the text extraction for PDF assets in
> OregonDigital.. I’m hopeful that 

Re: [CODE4LIB] OregonDigital's BookReader integration (was: Listserv communication)

2016-03-01 Thread Gum, Josh
Shaun, 

Thanks, I’m psyched to be at OSU! 

I think you’ve nailed down the process here, and there are a couple concepts 
that I wanted to follow-up on;

1. “Download document from search results list” : This would be a simple 
enhancement to the rendering of each search result and exposing the download 
link.. The software has access to all of the necessary values (document ID, and 
how to generate a “downloads” link for it) at render time, so adding a new link 
should be trivial.. It seems like it would be a good enhancement to me.

2. “Direct-to-page link” : Generating a link to guide a PDF reader to a 
specific page [1] seems easy, although I’m not sure that every reader would 
work the same. So the missing piece is being able to associate a SOLR hit with 
the page it was found in the PDF.. So, I think you’re right about needing to 
index each page individually in order to facilitate rendering a link to a 
specific page related to the search result hit being rendered on the page.

I can’t speak to the history behind implementing the search the way it is right 
now.. But it does seem like both of these concepts would be great additions to 
the next installment of OregonDigital!

[1] http://oregondigital.org/downloads/oregondigital:df66z508t?page=3

———
Josh Gum
Oregon State University Libraries and Press





On 2/29/16, 4:13 PM, "Code for Libraries on behalf of Shaun D. Ellis" 
 wrote:

>Josh,
>Congrats on the new gig, and thank you for this explanation of OregonDigital’s 
>BookReader integration.  I’m sorry I wasn’t more specific about this, but I 
>think the original question had less to do with the BookReader integration, 
>and more to do with a non-frameworky explanation of configuring Solr to return 
>direct links to pages where the keywords appear in a “compound” object, such 
>as a book.  
>
>As the original poster (Laura Buchholz) mentioned, it seems like OregonDigital 
>does not provide direct links until after the BookReader is loaded.  It’s only 
>then that pins are placed on the “slider nav” to indicate where the keyword 
>appears.  So, to answer the original question, it seems like all the full-text 
>may be dumped into a single Solr field that returns the object in the initial 
>search result, and then upon loading the BookReader makes a subsequent query 
>(limited to that one object) retrieve the “data payload” in your example to 
>then locate the exact pages where the terms appear?  Is that what’s going on 
>there?
>
>I suppose if you wanted to return all the page numbers in the original search 
>query, you may have to send each page individually to Solr to be indexed, and 
>if you have a viewer with conventions for "deep linking" (like the BookReader 
>has) you could generate the link for each page and index it to provide this 
>functionality.  
>
>I was curious as folks were posting all the inspiring digital collections 
>sites earlier today, so I looked for this pattern but didn’t see it.  Most of 
>the apps use the same pattern as OregonDigital (although my testing was not 
>particularly thorough, so let me know if I’m wrong, folks!).  On the 
>otherhand, you do see the "direct-to-page link" interface with both Amazon and 
>Google Books search, which takes you directly to the page from the initial 
>search results.
>
>So, I’m not sure if this was a conscious design decision on the part of 
>library digital collections creators, if the pattern is followed because it’s 
>considered a “best practice” or a “convention” in our field, or if it was just 
>simpler to implement.  
>
>Thanks again for the follow up,
>Shaun
>
>> On Feb 26, 2016, at 2:51 PM, Gum, Josh  wrote:
>> 
>> I’m very new (<1 month) to Oregon State University, library technology, and 
>> Code4Lib. So please bear with me. Also, I’m going to put a disclaimer out 
>> that I may be missing some of the picture here.. I’m willing to lend a hand 
>> digging into more details if needed, so please feel free to ask.
>> 
>> Also.. I’m going to split this part of the discussion into a separate 
>> thread, so we can address the question regarding the OregonDigital 
>> BookReader integration. I’ve done some digging this morning, and spoke to a 
>> colleague who took part in some of the text extraction for PDF assets in 
>> OregonDigital.. I’m hopeful that these details are enough to help connect 
>> the dots regarding our integration. 
>> 
>> 
>> When ingesting a PDF asset [1], we have a shell based processor [2] which 
>> executes “pdftotext” [3] to extract and store the text from a pdf with 
>> bounding boxes around each word in the file. 
>> 
>> The command executed on the server:
>> pdftotext -enc UTF-8 '#{file_path}' '#{output_file}' -bbox
>> 
>> The web UI for viewing a PDF and highlighting results is tied to BookReader 
>> [4], which has a great amount of functionality and is well documented 
>> online! [5]
>> 
>> The 

Re: [CODE4LIB] OregonDigital's BookReader integration (was: Listserv communication)

2016-02-29 Thread Chad Mills
So we index the OCR text in two Solr fields.  One that is just the OCR text and 
another that is a accumulation of all of the metadata and OCR for a resource.  
When a user does a "keyword" search we search against that Solr field with all 
of the accumulated information.  We never use the other Solr field of just the 
OCR text; expect for troubleshooting.  On a hit when a user is directed to a 
resource if the BookReader is available we only offer a "search inside" 
option[1].  If the user submits a search using that field then we only search 
the OCR XML; but we do that outside of Solr.  We have a service[2] that returns 
the payload results for the BookReader to render.

Now back when we started implementing the BookReader when rendering the 
resource view I queried that same BookReader search service using the keyword 
supplied by the user that initially got them to the resource.  I used it as a 
kind of look ahead/hint mechanism.  I would then sends queues to the interface 
to let the user know X number of hits were found in the document.  I also was 
able to construct links to the pages in the BookReader where the hits were 
found.  Functionally, it all worked well and was pretty trivial to implement.  
I tracked usage and asked for feedback.  A very high percentage of users just 
clicked the link to open the BookReader and never used the hinting mechanism.  
So I disabled it; finding it not worth the extra pull on our system and 
clients.  I never dug deeper to find out what the greater issue was with the 
low level of use.

Best,
Chad

[1] https://rucore.libraries.rutgers.edu/rutgers-lib/41256/
[2] https://github.com/RutgersUniversityLibraries/OCR-search-for-IA-reader


***
Chad Mills Rutgers University Libraries
Digital Library Architect  Scholarly Communication Center
Ph: 848.932.5924   Room 409D, Alexander Library
Fax: 848.932.1386  169 College Avenue, New Brunswick, NJ 08901
Cell: 732.309.8538 https://rucore.libraries.rutgers.edu/
***

- Original Message -
From: "Shaun D. Ellis" <sha...@princeton.edu>
To: CODE4LIB@LISTSERV.ND.EDU
Sent: Monday, February 29, 2016 7:13:39 PM
Subject: Re: [CODE4LIB] OregonDigital's BookReader integration (was: Listserv 
communication)

Josh,
Congrats on the new gig, and thank you for this explanation of OregonDigital’s 
BookReader integration.  I’m sorry I wasn’t more specific about this, but I 
think the original question had less to do with the BookReader integration, and 
more to do with a non-frameworky explanation of configuring Solr to return 
direct links to pages where the keywords appear in a “compound” object, such as 
a book.  

As the original poster (Laura Buchholz) mentioned, it seems like OregonDigital 
does not provide direct links until after the BookReader is loaded.  It’s only 
then that pins are placed on the “slider nav” to indicate where the keyword 
appears.  So, to answer the original question, it seems like all the full-text 
may be dumped into a single Solr field that returns the object in the initial 
search result, and then upon loading the BookReader makes a subsequent query 
(limited to that one object) retrieve the “data payload” in your example to 
then locate the exact pages where the terms appear?  Is that what’s going on 
there?

I suppose if you wanted to return all the page numbers in the original search 
query, you may have to send each page individually to Solr to be indexed, and 
if you have a viewer with conventions for "deep linking" (like the BookReader 
has) you could generate the link for each page and index it to provide this 
functionality.  

I was curious as folks were posting all the inspiring digital collections sites 
earlier today, so I looked for this pattern but didn’t see it.  Most of the 
apps use the same pattern as OregonDigital (although my testing was not 
particularly thorough, so let me know if I’m wrong, folks!).  On the otherhand, 
you do see the "direct-to-page link" interface with both Amazon and Google 
Books search, which takes you directly to the page from the initial search 
results.

So, I’m not sure if this was a conscious design decision on the part of library 
digital collections creators, if the pattern is followed because it’s 
considered a “best practice” or a “convention” in our field, or if it was just 
simpler to implement.  

Thanks again for the follow up,
Shaun

> On Feb 26, 2016, at 2:51 PM, Gum, Josh <josh@oregonstate.edu> wrote:
> 
> I’m very new (<1 month) to Oregon State University, library technology, and 
> Code4Lib. So please bear with me. Also, I’m going to put a disclaimer out 
> that I may be missing some of the picture here.. I’m willing to lend a hand 
> digging into

Re: [CODE4LIB] OregonDigital's BookReader integration (was: Listserv communication)

2016-02-29 Thread Shaun D. Ellis
Josh,
Congrats on the new gig, and thank you for this explanation of OregonDigital’s 
BookReader integration.  I’m sorry I wasn’t more specific about this, but I 
think the original question had less to do with the BookReader integration, and 
more to do with a non-frameworky explanation of configuring Solr to return 
direct links to pages where the keywords appear in a “compound” object, such as 
a book.  

As the original poster (Laura Buchholz) mentioned, it seems like OregonDigital 
does not provide direct links until after the BookReader is loaded.  It’s only 
then that pins are placed on the “slider nav” to indicate where the keyword 
appears.  So, to answer the original question, it seems like all the full-text 
may be dumped into a single Solr field that returns the object in the initial 
search result, and then upon loading the BookReader makes a subsequent query 
(limited to that one object) retrieve the “data payload” in your example to 
then locate the exact pages where the terms appear?  Is that what’s going on 
there?

I suppose if you wanted to return all the page numbers in the original search 
query, you may have to send each page individually to Solr to be indexed, and 
if you have a viewer with conventions for "deep linking" (like the BookReader 
has) you could generate the link for each page and index it to provide this 
functionality.  

I was curious as folks were posting all the inspiring digital collections sites 
earlier today, so I looked for this pattern but didn’t see it.  Most of the 
apps use the same pattern as OregonDigital (although my testing was not 
particularly thorough, so let me know if I’m wrong, folks!).  On the otherhand, 
you do see the "direct-to-page link" interface with both Amazon and Google 
Books search, which takes you directly to the page from the initial search 
results.

So, I’m not sure if this was a conscious design decision on the part of library 
digital collections creators, if the pattern is followed because it’s 
considered a “best practice” or a “convention” in our field, or if it was just 
simpler to implement.  

Thanks again for the follow up,
Shaun

> On Feb 26, 2016, at 2:51 PM, Gum, Josh  wrote:
> 
> I’m very new (<1 month) to Oregon State University, library technology, and 
> Code4Lib. So please bear with me. Also, I’m going to put a disclaimer out 
> that I may be missing some of the picture here.. I’m willing to lend a hand 
> digging into more details if needed, so please feel free to ask.
> 
> Also.. I’m going to split this part of the discussion into a separate thread, 
> so we can address the question regarding the OregonDigital BookReader 
> integration. I’ve done some digging this morning, and spoke to a colleague 
> who took part in some of the text extraction for PDF assets in 
> OregonDigital.. I’m hopeful that these details are enough to help connect the 
> dots regarding our integration. 
> 
> 
> When ingesting a PDF asset [1], we have a shell based processor [2] which 
> executes “pdftotext” [3] to extract and store the text from a pdf with 
> bounding boxes around each word in the file. 
> 
> The command executed on the server:
> pdftotext -enc UTF-8 '#{file_path}' '#{output_file}' -bbox
> 
> The web UI for viewing a PDF and highlighting results is tied to BookReader 
> [4], which has a great amount of functionality and is well documented online! 
> [5]
> 
> The BookReader is making calls to a “full_text” action on the 
> document_controller to find the location of the search terms. [6] This JSONP 
> call to our web server uses OregonDigital::OCR::BookreaderSearchGenerator [7] 
> to supply the properly formatted page and bounding box results to BookReader 
> to use in updating its UI with the appropriate highlights and place marker 
> icons. If you use something like the Chrome DevTools while searching for a 
> term on the BookReader UI, you can see the data payload that is returned from 
> the server. For instance, here’s a snippet of one search I did:
> 
> 
> (apologies if the tabs don’t remain in the email)
> matches: [
>   {
>   par: [
>   {
>   page: 2, 
>   boxes: [
>   {r: 128.62286274509802, l: 
> 101.30935784313726, b: 27.52538962121212, t: 19.953774090909093, page: 2}
>   {r: 59.883534313725484, l: 
> 29.41176470588235, b: 242.4078138636364, t: 234.8361983336, page: 2}
>   {r: 106.32754411764705, l: 
> 80.37296078431372, b: 546.3512438560606, t: 538.7796283257576, page: 2}
>   text: "McKenzie Highway {{{Historic}}} District…
>   }
>   ]
>   }
> ]
> 
> 
> [1] 
> https://github.com/OregonDigital/oregondigital/blob/master/app/models/document.rb
> 
> [2] 
>