Re: Fast way to get the start of document

Mike Sokolov Sat, 23 Jun 2012 19:16:44 -0700

I got the sense from Paul's post that he wanted a solution that didn'trequire changing his index, although I'm not sure there is one. Paul ifyou're willing to re-index, you could also store the length of the textas a numeric field, retrieve that and use it to drive the decision aboutwhether to highlight.


-Mike Sokolov


On 6/23/2012 6:17 PM, Jack Krupansky wrote:

Simply have two fields, "full_body" and "limited_body". The formerwould index but not store the full document text from Tika (the"content" metadata.) The latter would store but not necessarily indexthe first 10K or so characters of the full text. Do searches on thefull body field and highlighting on the limited body field.
-- Jack Krupansky

-----Original Message----- From: Paul Hill
Sent: Friday, June 22, 2012 2:23 PM
To: [email protected]
Subject: Fast way to get the start of document
Our Hit highlighting (Using the older Highlighter) is wired with a"too huge" limit, so we could skip the multi-million character files,not just for highlighter.setMaxDocCharsToAnalyze, but if a document isreally above the too huge limit, we don'teven try, and just produce a fragment from the front of the document.This results in almost reasonable response to time, even for a resultsets of crazy huge documents (or ones with just 1 huge doc). I thinkthis is all pretty normal. Tell me if I'm wrong.
Given the above, while timing what was going on, I realized that I wasreading in the entire body of the text in the skip highlighting casejust to grab the 1st 100 or so characters.
I was doing

String text = fieldable.stringValue(); // Oh my!
Is there a way to _not_ read the whole multi-million characters in andonly _start_ reading the contents of a large field? See code belowwhich got me no better results.
Some details

1.      Using Lucene 3.4

2.      Storing the (Tika) parse text of documents
a. These are human produced documents; PDF, word etc. often 10Kof characters, sometimes 100Ks, but very occasionally a few million)
3.      At this time, we store positions, but not offsets.
4. We are using the old Highlighter, not theFastVectorHighlighter (because of #3 above).
5. A basic search result is a page of 10 documents with short"blurb" (one fragment that shows a good hit).
I would be willing to live with a token stream to gen the intro blurb,but using the following code when under the too large code path(forget the highlighting) can add .5 seconds (compared to not readinganything which is not a solution just a comparison).
So here is my code.
       Fieldable textFld = doc.getFieldable(TEXT);
       if ( fullTextLength <= EXTRA_LARGE_DOC_HIGHLIGHT_LIMIT ) {
blurb = highlightBlurb(scoreDoc, document, textFld,workingBlurbLen);
       } else {
logger.debug("----------- didn't call highlightertextLength = " + fullTextLength);TokenStream tokenStream =TokenSources.getAnyTokenStream(indexReader, scoreDoc.doc, TEXT,document, analyzer);OffsetAttribute offset =tokenStream.addAttribute(OffsetAttribute.class);CharTermAttribute charTerm =tokenStream.addAttribute(CharTermAttribute.class);
           StringBuilder blurbB = new StringBuilder("");
while (tokenStream.incrementToken() && blurbB.length() <workingBlurbLen) {
               blurbB.append(charTerm.toString());
               blurbB.append(" ");
           }
           blurb = blurbB.toString();
       }
What could I do in the else that is faster? Is not having offsetseffecting this code path?While your answering the above, I will be running some stats tosuggest to management why we SHOULD store offsets, so we can useFastVectorHighlighter,
but I'm afraid I might still want the too-huge-to-highlight path.

-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Fast way to get the start of document

Reply via email to