Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
The Lucene Highlighter doesn't require that the text you want highlighted be stored. In fact, you can pass in any arbitrary text to the Highlighter. See the various getBestFragments from the Highlighter class: http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/highlight/Highlighter.html Erik On Sep 29, 2009, at 7:01 AM, Yitzchak Schaffer wrote: Hello, Sorry for any cross-posting annoyance. I have a request for a Greenstone collection I'm working on, to add context snippets to search results; for example a search for yak culture might return this in the list of results: ... addressing the fine points of strongyak culture/strong, the zoosociologists took into account ... Sounds like a pretty basic feature, say our sponsors, and I agree. (Ah, it's also an old Trac ticket at http://trac.greenstone.org/ticket/444) I see that GS out-of-the-box is set *not* to store the fulltext in the index, which seems to be a prerequisite for this kind of thing, as in http://bit.ly/ljNkL . Has anyone modified the Lucene indexing wrapper locally to do this? Given that we don't have any Java coders on staff, I've started porting the Lucene wrapper to PHP for use with a custombuilder.pl and Zend_Search_Lucene. I already have a PHP frontend, so adjusting that to display the results shouldn't be a problem; OTOH because the frontend is PHP, I'm restricted to using buildtype lucene, or something else with good PHP support. Many thanks, -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
Erik Hatcher wrote: The Lucene Highlighter doesn't require that the text you want highlighted be stored. In fact, you can pass in any arbitrary text to the Highlighter. Thanks Erik, What I'm looking for is to return the context of the search result, not just the ID of the containing document - e.g. when all I input is yak culture, I get back the context from the document as a search result, without having to retrieve the doc itself: ... addressing the fine points of strongyak culture/strong, the zoosociologists took into account ... GS out of the box does not appear to support this, as it does not store the fulltext in the index. So yes, I can highlight stuff, but as it stands, I don't have the text to work with. IANA Lucene guru, so correct me if I misunderstand. -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
On Sep 29, 2009, at 7:33 AM, Yitzchak Schaffer wrote: Erik Hatcher wrote: The Lucene Highlighter doesn't require that the text you want highlighted be stored. In fact, you can pass in any arbitrary text to the Highlighter. Thanks Erik, What I'm looking for is to return the context of the search result, not just the ID of the containing document - e.g. when all I input is yak culture, I get back the context from the document as a search result, without having to retrieve the doc itself: ... addressing the fine points of strongyak culture/strong, the zoosociologists took into account ... GS out of the box does not appear to support this, as it does not store the fulltext in the index. So yes, I can highlight stuff, but as it stands, I don't have the text to work with. IANA Lucene guru, so correct me if I misunderstand. I'm a bit confused then. You mentioned that somehow Zend Lucene was going to help, but if you don't have the text to highlight anywhere then the Highlighter isn't going to be of any use. Again, you don't need the full text in the Lucene index, but you do need it get it from somewhere in order to be able to highlight it. Erik
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
Erik Hatcher wrote: I'm a bit confused then. You mentioned that somehow Zend Lucene was going to help, but if you don't have the text to highlight anywhere then the Highlighter isn't going to be of any use. Again, you don't need the full text in the Lucene index, but you do need it get it from somewhere in order to be able to highlight it. Erik, I started to port the native Greenstone Java Lucene wrapper to PHP, so I could then modify it to add this feature, as I don't know Java. This would mean using Zend Lucene for the actual indexing implementation. My question is whether anyone's already done it, in Java or otherwise. Thanks for the clarification, -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
Yitzchak, are you interested in actually searching the fulltext? Or just highlighting the terms? If you're only interested in highlighting it, it might be a whole lot easier to implement this in javascript through something like jQuery: http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html That way you're not juggling mostly redundant Lucene indexes and trying to keep them synced. How are you getting your search results? Does Greenstone have some sort of search API that returns the highlighted results? Would it make a difference if you could add a field to the Lucene document (meaning would you have access to it through your PHP API to Greenstone)? If so, you could probably do this pretty easily via one of the JVM scripting languages (Groovy, JRuby, Jython, Quercus -- PHP in the JVM) so you just have the single Lucene index instead of multiple. Another approach might be to serve the Lucene index via Solr [1] or Lucene-WS (http://lucene-ws.net/) which would allow you to skip Greenstone altogether for searching. Basically, I would try to avoid going the Zend_Lucene route if at all possible. -Ross. 1. http://www.google.com/search?q=solr+on+an+existing+lucene+indexie=utf-8oe=utf-8aq=trls=org.mozilla:en-US:officialclient=firefox-a On Tue, Sep 29, 2009 at 11:32 AM, Yitzchak Schaffer yitzchak.schaf...@gmx.com wrote: Erik Hatcher wrote: I'm a bit confused then. You mentioned that somehow Zend Lucene was going to help, but if you don't have the text to highlight anywhere then the Highlighter isn't going to be of any use. Again, you don't need the full text in the Lucene index, but you do need it get it from somewhere in order to be able to highlight it. Erik, I started to port the native Greenstone Java Lucene wrapper to PHP, so I could then modify it to add this feature, as I don't know Java. This would mean using Zend Lucene for the actual indexing implementation. My question is whether anyone's already done it, in Java or otherwise. Thanks for the clarification, -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
Ross Singer wrote: Yitzchak, are you interested in actually searching the fulltext? Or just highlighting the terms? Sorry this wasn't clearer. Let me re-summarize, and report on a new development: - Greenstone allows for Lucene as one of the indexing plugins - I took advantage of this for use in our PHP frontend, EmeraldView (http://emeraldview.tourolib.org/) - Greenstone includes a Java wrapper class for Lucene which indexes documents as the collection is built - This wrapper class indexes but does not store the document full text; thus a search only returns document IDs of hits. This means that, in order to place search terms in context, we have to load the actual documents. I want the search API itself to return the surrounding text. New info: I was in fact able to hack the Java to include the full text in the index. Just a matter of adding a line of code and an if statement, once I'd been immersed in the code long enough. Trying to port it to PHP (i.e. rewrite it) was instrumental in figuring out why in the world the Greenstone indexing code is structured the way it is. -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com
Re: [CODE4LIB] Greenstone: tweaking Lucene indexing
Ross Singer wrote: Yitzchak, are you interested in actually searching the fulltext? Or just highlighting the terms? Just in case my earlier response didn't make it crystal clear: we're trying to search the fulltext, and put the search string in context within the document which includes it. -- Yitzchak Schaffer Systems Manager Touro College Libraries 33 West 23rd Street New York, NY 10010 Tel (212) 463-0400 x5230 Fax (212) 627-3197 Email yitzchak.schaf...@gmx.com