[dspace-tech] BookReader integration in DSpace + fulltext searching inside the document

Pedro Amorim Thu, 06 Apr 2017 10:24:06 -0700

Hello everyone,

I'm currently trying to plan an implementation of this and wanted to ask 
the opinion of the developers on how to go about it.


I have seen the great resources provided by Peter Dietz@LongSight regarding 
the integration of BookReader 
<https://github.com/internetarchive/bookreader> in DSpace such as the video 
demo <https://www.youtube.com/watch?v=mZkvfxPrwZw> and the source code 
<https://github.com/peterdietz/DSpace/tree/bookreader/dspace-xmlui/src/main/webapp/themes/wheaton-mirage2/vendor/BookReader>
.
And all works great, provided that the bitstreams contained in the item 
follow a specific nomenclature (001.jpg, 002.jpg, 003.jpg, etc) so that the 
client app can request/render them in the correct order and request page 
ranges, etc.

However, the feature of searching within the document itself is disabled, 
because - I believe - this particular feature needs a backend to supply the 
client app with the needed information.
This can be seen in production in archive.org or with a specific example of 
searching 
the term *Socrates* within a book 
<https://archive.org/stream/in.ernet.dli.2015.50197/2015.50197.Plato#page/n105/mode/2up/search/Socrates>
.

The backend from internet archives' BookReader returns a JSON entry for 
every hit, example:

{
    "text": "fly towards him, nestle in his breast, and then spread its 
wings and soai upwards, singing most sweetly The next morning Ariston 
appeared, leading his son Plato to the philosopher, and {{{Socrates}}} knew 
that his dieam was fulfilled", 
    "par": [
        {
            "boxes": [
                {
                    "r": 694, 
                    "b": 412, 
                    "t": 358, 
                    "page": 10, 
                    "l": 531
                }
            ], 
            "b": 463, 
            "t": 172, 
            "page_width": 1243, 
            "r": 1146, 
            "l": 28, 
            "page_height": 2123, 
            "page": 10
        }
    ]
}

This makes sense because with this info the client app can *1)* correctly 
pinpoint the specific pages where the term is found and *2)* correctly 
render the highlight box around the searched term within the page being 
presented using the 'coordinates' and dimensions.

*Assuming:*
1) Have all the required bitstreams in jpeg format and in the correct 
naming convention mentioned above;
2) Have the required word location information in ALTO.xml files (DSpace 
wouldn't generate that info, need only to process/serve it).

*How would one have DSpace act as a backend for the BookReader client app?*

The best theorycrafting I've come up with thus far is to build a custom 
media-filter that would interpret the word information contained in the 
ALTO.xml files for each item, and store this information in a new custom 
SOLR index, that would afterwards be queried by the client app. Every item 
would have their own word index with information for each word (page, 
width, height, vpos, hpos), this means this particular index would have to 
be repeated for every word and serve only the *hits* to the client app.

For example, the following query:
<DSpaceURL>/solr/search/select?q=search.resourceid:<itemID>&word.value=<searchTerm>

Would return the information for all the occurrences of the *word* index 
with the value <searchTerm> (above ex: Socrates). 
IF this would be accomplished, in theory, it would work.

Has anyone got other idea on this? Or implemented something similar before? 
Or thought about it before?

Sorry for the wall of text.

Thanks as always,

Pedro Amorim






-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

[dspace-tech] BookReader integration in DSpace + fulltext searching inside the document

Reply via email to