RE: Problems indexing large documents

Rob Staveley (Tom) Sat, 10 Jun 2006 00:01:17 -0700

The answer was of course in the FAQ -
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-3558e5121806fb4fce80fc0
22d889484a9248b71


Breaking large documents into manageable chunks isn't ideal. I need to index
e-mail and with attachments which are frequently large. Currently each
message part corresponds to a Lucene Document, but that means I am
discarding terms > maxFieldLength. It is ugly having to span message parts
across multiple Lucene Documents for various reasons - e.g. a search returns
multiple Documents with different relevance, but more than one of these
Documents refer to the same message part.

Two thoughts:

(1) If the sentence "XX YY XX ZZ XX" was indexed, does that count as 3 terms
in this context or 5? If repeat terms are not counted, I can probably cope
with increasing the size of the heap and increasing maxFieldLength to deal
with realistic vocabularies, and I ought to be able to cope with most large
documents.

(2) Lucene wishlist thought... Would it be realistic to have an option for
Field indexing, which isn't entirely in RAM? The client code knows when the
Field is going to be a big one, because it can look at the file size before
passing the Field the java.io.Reader. If we could have a flag in Field that
says "do this the slow way because the calling code already knows that it is
a big one" and Otis, Eric & Co could work their magic, we could perhaps have
large Lucene Documents without running out of heap space. maxFieldLength =
-1 could perhaps denote what's needed??

-----Original Message-----
From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED] 
Sent: 10 June 2006 07:22
To: java-user@lucene.apache.org
Subject: RE: Problems indexing large documents

I'm trying to come to terms with
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h
tml#setMaxFieldLength(int) too. I've been attempting to index large text
files as single Lucene documents, passing them as java.io.Reader to cope
with RAM. I was assuming (like - I suspect - manu mohedano) that an unstored
field could be of any length and that maxFieldLength was only applicable to
stored fields. Do we in fact need to break the document into manageable
parts?

-----Original Message-----
From: Pasha Bizhan [mailto:[EMAIL PROTECTED] 
Sent: 09 June 2006 21:35
To: java-user@lucene.apache.org
Subject: RE: Problems indexing large documents

Hi, 

> From: manu mohedano [mailto:[EMAIL PROTECTED] 

> Hi All! I have a trouble... When I index text documents in 
> english, there is no problem, buy when I index Spanish text 
> documents (And they're big), a lot of information from the 
> document don't become indexed (I suppose it is due to the 
> Analyzer, but if the documents is less tahn 400kb it works 
> perfectly). Howewer I want to Index ALL the strings in the 
> document with no StopWords. Is this possible??

Read javadoc about  DEFAULT_MAX_FIELD_LENGTH at
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h
tml#setMaxFieldLength(int) 

Pasha Bizhan



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problems indexing large documents

Reply via email to