Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Erik Hatcher
The Lucene Highlighter doesn't require that the text you want  
highlighted be stored.  In fact, you can pass in any arbitrary text to  
the Highlighter.


See the various getBestFragments from the Highlighter class:
  http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/highlight/Highlighter.html 



Erik


On Sep 29, 2009, at 7:01 AM, Yitzchak Schaffer wrote:


Hello,

Sorry for any cross-posting annoyance.  I have a request for a  
Greenstone collection I'm working on, to add context snippets to  
search results; for example a search for yak culture might return  
this in the list of results:


... addressing the fine points of strongyak culture/strong, the  
zoosociologists took into account ...


Sounds like a pretty basic feature, say our sponsors, and I agree.   
(Ah, it's also an old Trac ticket at http://trac.greenstone.org/ticket/444)


I see that GS out-of-the-box is set *not* to store the fulltext in  
the index, which seems to be a prerequisite for this kind of thing,  
as in http://bit.ly/ljNkL .  Has anyone modified the Lucene indexing  
wrapper locally to do this?


Given that we don't have any Java coders on staff, I've started  
porting the Lucene wrapper to PHP for use with a custombuilder.pl  
and Zend_Search_Lucene.  I already have a PHP frontend, so adjusting  
that to display the results shouldn't be a problem; OTOH because the  
frontend is PHP, I'm restricted to using buildtype lucene, or  
something else with good PHP support.


Many thanks,

--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Yitzchak Schaffer

Erik Hatcher wrote:
The Lucene Highlighter doesn't require that the text you want 
highlighted be stored.  In fact, you can pass in any arbitrary text to 
the Highlighter.


Thanks Erik,

What I'm looking for is to return the context of the search result, not 
just the ID of the containing document - e.g. when all I input is yak 
culture, I get back the context from the document as a search result, 
without having to retrieve the doc itself:


... addressing the fine points of strongyak culture/strong, the 
zoosociologists took into account ...


GS out of the box does not appear to support this, as it does not store 
the fulltext in the index.  So yes, I can highlight stuff, but as it 
stands, I don't have the text to work with.  IANA Lucene guru, so 
correct me if I misunderstand.


--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Erik Hatcher

On Sep 29, 2009, at 7:33 AM, Yitzchak Schaffer wrote:


Erik Hatcher wrote:
The Lucene Highlighter doesn't require that the text you want  
highlighted be stored.  In fact, you can pass in any arbitrary text  
to the Highlighter.


Thanks Erik,

What I'm looking for is to return the context of the search result,  
not just the ID of the containing document - e.g. when all I input  
is yak culture, I get back the context from the document as a  
search result, without having to retrieve the doc itself:


... addressing the fine points of strongyak culture/strong, the  
zoosociologists took into account ...


GS out of the box does not appear to support this, as it does not  
store the fulltext in the index.  So yes, I can highlight stuff, but  
as it stands, I don't have the text to work with.  IANA Lucene guru,  
so correct me if I misunderstand.


I'm a bit confused then.  You mentioned that somehow Zend Lucene was  
going to help, but if you don't have the text to highlight anywhere  
then the Highlighter isn't going to be of any use.  Again, you don't  
need the full text in the Lucene index, but you do need it get it from  
somewhere in order to be able to highlight it.


Erik


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Yitzchak Schaffer

Erik Hatcher wrote:
I'm a bit confused then.  You mentioned that somehow Zend Lucene was 
going to help, but if you don't have the text to highlight anywhere then 
the Highlighter isn't going to be of any use.  Again, you don't need the 
full text in the Lucene index, but you do need it get it from somewhere 
in order to be able to highlight it.


Erik,

I started to port the native Greenstone Java Lucene wrapper to PHP, so I 
could then modify it to add this feature, as I don't know Java.  This 
would mean using Zend Lucene for the actual indexing implementation.  My 
question is whether anyone's already done it, in Java or otherwise.


Thanks for the clarification,

--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Ross Singer
Yitzchak, are you interested in actually searching the fulltext?  Or just
highlighting the terms?

If you're only interested in highlighting it, it might be a whole lot easier
to implement this in javascript through something like jQuery:

http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html

That way you're not juggling mostly redundant Lucene indexes and trying to
keep them synced.

How are you getting your search results?  Does Greenstone have some sort of
search API that returns the highlighted results?  Would it make a difference
if you could add a field to the Lucene document (meaning would you have
access to it through your PHP API to Greenstone)?  If so, you could probably
do this pretty easily via one of the JVM scripting languages (Groovy, JRuby,
Jython, Quercus -- PHP in the JVM) so you just have the single Lucene index
instead of multiple.

Another approach might be to serve the Lucene index via Solr [1] or
Lucene-WS (http://lucene-ws.net/) which would allow you to skip Greenstone
altogether for searching.

Basically, I would try to avoid going the Zend_Lucene route if at all
possible.

-Ross.

1.
http://www.google.com/search?q=solr+on+an+existing+lucene+indexie=utf-8oe=utf-8aq=trls=org.mozilla:en-US:officialclient=firefox-a

On Tue, Sep 29, 2009 at 11:32 AM, Yitzchak Schaffer 
yitzchak.schaf...@gmx.com wrote:

 Erik Hatcher wrote:

 I'm a bit confused then.  You mentioned that somehow Zend Lucene was going
 to help, but if you don't have the text to highlight anywhere then the
 Highlighter isn't going to be of any use.  Again, you don't need the full
 text in the Lucene index, but you do need it get it from somewhere in order
 to be able to highlight it.


 Erik,

 I started to port the native Greenstone Java Lucene wrapper to PHP, so I
 could then modify it to add this feature, as I don't know Java.  This would
 mean using Zend Lucene for the actual indexing implementation.  My question
 is whether anyone's already done it, in Java or otherwise.

 Thanks for the clarification,


 --
 Yitzchak Schaffer
 Systems Manager
 Touro College Libraries
 33 West 23rd Street
 New York, NY 10010
 Tel (212) 463-0400 x5230
 Fax (212) 627-3197
 Email yitzchak.schaf...@gmx.com



Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Yitzchak Schaffer

Ross Singer wrote:

Yitzchak, are you interested in actually searching the fulltext?  Or just
highlighting the terms?


Sorry this wasn't clearer.  Let me re-summarize, and report on a new 
development:


- Greenstone allows for Lucene as one of the indexing plugins

- I took advantage of this for use in our PHP frontend, EmeraldView 
(http://emeraldview.tourolib.org/)


- Greenstone includes a Java wrapper class for Lucene which indexes 
documents as the collection is built


- This wrapper class indexes but does not store the document full text; 
thus a search only returns document IDs of hits.  This means that, in 
order to place search terms in context, we have to load the actual 
documents.  I want the search API itself to return the surrounding text.


New info:

I was in fact able to hack the Java to include the full text in the 
index.  Just a matter of adding a line of code and an if statement, 
once I'd been immersed in the code long enough.  Trying to port it to 
PHP (i.e. rewrite it) was instrumental in figuring out why in the world 
the Greenstone indexing code is structured the way it is.


--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com


Re: [CODE4LIB] Greenstone: tweaking Lucene indexing

2009-09-29 Thread Yitzchak Schaffer

Ross Singer wrote:

Yitzchak, are you interested in actually searching the fulltext?  Or just
highlighting the terms?


Just in case my earlier response didn't make it crystal clear: we're 
trying to search the fulltext, and put the search string in context 
within the document which includes it.


--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@gmx.com