David,
I tried your technique, i am directly streaminf the pdf file in to
Lucene highlighter as below and i get a NPE in
highlighter.getBestFragments(tokenStream, docAsString, 3, "...");
API doc is not very clear here, i fed the contents of query string
(instead of docAsString)to this method and still i get NPE..
Can you shed some light on this please!! Please post your code snippet
if you can!
My code snippet:
File f = new File(sourceDocLocation);
if (!f.exists())
{
log.debug("File does not exist" + f.getAbsolutePath() +"
"+ f.getName());
return null;
}
org.apache.lucene.document.Document doc =
LucenePDFDocument.getDocument(f);
Highlighter highlighter = new Highlighter(new
QueryScorer(query));
TokenStream tokenStream = new
SimpleAnalyzer().tokenStream(FIELD_NAME,
new FileReader(f));
doc.add(Field.Text("contents", new FileReader(f)));
// Get 3 best fragments and seperate with a "..."
=========>>>>>>>> result =
highlighter.getBestFragments(tokenStream, queryString, 3, "...");
<<<<<<<<========
Thanks,
Vijay Balasubramanian
DPRA Inc.,
214 665 7503
David Spencer
<dave-lucene-user To: Lucene Users List <[EMAIL
PROTECTED]>
@tropo.com> cc:
Subject: Re: Highlighting PDF file
after the search
09/20/2004 05:02
PM
Please respond to
Lucene Users List
[EMAIL PROTECTED] wrote:
>
>
>
> Hello,
>
> I can successfully index and search the PDF documents, however i am
not
> able to highlight the searched text in my original PDF file (ie: like
> dtSearch
> highlights on original file)
>
> I took a look at the highlighter in sandbox, compiled it and have it
> ready. I am wondering if this highlighter is for highlighting indexed
> documents or
> can it be used for PDF Files as is ! Please enlighten !
I did this a few weeks ago.
There are two ways, and they both revolve round the same thing, you need
the tokenized PDF text available.
[a] Store the tokenized PDF text in the index, or in some other file on
disk i.e. a "cache" ( but cache is a misleading term, as you can't have
a cache miss unless you can do [b]).
[b] Tokenize it on the fly when you call getBestFragments() - the 1st
arg, the TokenStream, should be one that takes a PDF file as input and
tokenizes it.
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/org/apache/lucene/search/highlight/Highlighter.html#getBestFragments(org.apache.lucene.analysis.TokenStream,%20java.lang.String,%20int,%20java.lang.String)
>
> Thanks,
>
> Vijay Balasubramanian
> DPRA Inc.,
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]