Re: Does Index have a Tokenizer Built into it

John Paul Sondag Mon, 16 Jul 2007 09:39:57 -0700

Some of the data sets that will be using have about 2 TB of data (90 million
web pages).  The Snippet I will be generating I would like to include the
words that are being queried, so I don't want to simply store the first 2 or
3 lines.  I have looked at the HighlighterTest and I do believe that it
requires the entire text of the document.  However, unlike the highlighter I
know where the termOffset in the document.


The input to my Snippet will be a vector of querywords and their offsets in
the document.  (not their position in the document).  I'm reading about the
option "term vectors" I can store while indexing my data.  It seems to be
much more efficient than storing the entire document, I'm just not sure if
the "term offset" is the same as a "token offset".  Here's what I'm reading
in case I'm totally off the ball here and this is useless to me:

http://lucene.apache.org/java/docs/fileformats.html#Term%20Vectors

It seems like this has all the information that I would have if I tokenized
the document anyways, or am I missing something?

Thanks again for all the help!

--JP




On 7/16/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote:


Hello,

> Ard,
>
> I do have access to the URL's of the documents, but because I
> will be making
> short snippets for many pages (suppose it had about 20 hits
> per page and I
> need to make Snippets for each of them) I was worried it would be
> inefficient to open each "hit" tokenize it and then make the
> Snippet, of

Yes, getting all the documents over http just to get the snippet, for
example the first 2 lines, is really bad for your performance in search
overviews.

Logically, what you want to show, you need to store in your index. For
example, if for search hits you need to show the title and subtitle, just
store these two in the index. If you want to have a google like highlighter
of text snippets where the term occured, you need to store the entire text
IIRC (see HighlighterTest in lucene).

How many docs are you talking about that you cannot store the entire
content?

You could also just index the content and not store it, and in another
lucene field, store the first 2 or 3 lines of  the document, which serve as
text snippet. Making correct extracts of text snippets is very hard (see
lingpipe for example)

Regards Ard

> course the price of this may be worth the price of the increased Index
> size.  I have been looking into storing "Field Vectors with
> positions" in
> the index.  It seems that by doing this I will have access to
> everything
> that the Tokenizer is giving me correct?   Will I need to
> store "term text"
> in order to be able to access the actual term instead of
> stemmed words?
>
> Thanks for all your help,
>
> --JP
>
> On 7/13/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote:
> >
> > Hello,
> >
> > > I'm wondering if after
> > > opening the
> > > index I can retrieve the Tokens (not the terms) of a
> > > document, something
> > > akin to IndexReader.Document(n).getTokenizer().
> >
> > It is obviously not possible to get the original tokens of
> the document
> > back when you haven't stored the document, because:
> >
> > 1) the analyzer might have removed stop words in the first place
> > 2) the terms in lucene index are perhaps stemmed words /
> synonyms / etc
> > etc
> > 3) how would you expect things like spaces, commas, dots etc to be
> > restored?
> >
> > And, I think what you want does not comply with an inverted
> index. When
> > you do not store the document, you always loose information
> about the
> > document during indexing/analyzing
> >
> > How many documents are you talking about? They must be
> either somewhere on
> > FS or accessible over http...when you need the document,
> why not just
> > provide a link to the original location?
> >
> > Regards Ard
> >
> > >
> > > In summary:
> > >
> > > My current ( too wasteful implementation is this)
> > >
> > > StandardTokenizer(BufferedReader (
> > > IndexReader.Document(n).getField("text"
> > > )  )
> > >
> > > I'm wondering if Lucene has a more efficient manner to
> > > retrieve the tokens
> > > of a document from an index.  Because it seems like it has
> > > information about
> > > every "term" already, Since you can get retrieve a
> > > TermPositions object.
> > >
> > > Thanks,
> > >
> > >
> > > --JP
> > >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does Index have a Tokenizer Built into it

Reply via email to