The answer was of course in the FAQ - http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-3558e5121806fb4fce80fc0 22d889484a9248b71
Breaking large documents into manageable chunks isn't ideal. I need to index e-mail and with attachments which are frequently large. Currently each message part corresponds to a Lucene Document, but that means I am discarding terms > maxFieldLength. It is ugly having to span message parts across multiple Lucene Documents for various reasons - e.g. a search returns multiple Documents with different relevance, but more than one of these Documents refer to the same message part. Two thoughts: (1) If the sentence "XX YY XX ZZ XX" was indexed, does that count as 3 terms in this context or 5? If repeat terms are not counted, I can probably cope with increasing the size of the heap and increasing maxFieldLength to deal with realistic vocabularies, and I ought to be able to cope with most large documents. (2) Lucene wishlist thought... Would it be realistic to have an option for Field indexing, which isn't entirely in RAM? The client code knows when the Field is going to be a big one, because it can look at the file size before passing the Field the java.io.Reader. If we could have a flag in Field that says "do this the slow way because the calling code already knows that it is a big one" and Otis, Eric & Co could work their magic, we could perhaps have large Lucene Documents without running out of heap space. maxFieldLength = -1 could perhaps denote what's needed?? -----Original Message----- From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED] Sent: 10 June 2006 07:22 To: java-user@lucene.apache.org Subject: RE: Problems indexing large documents I'm trying to come to terms with http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h tml#setMaxFieldLength(int) too. I've been attempting to index large text files as single Lucene documents, passing them as java.io.Reader to cope with RAM. I was assuming (like - I suspect - manu mohedano) that an unstored field could be of any length and that maxFieldLength was only applicable to stored fields. Do we in fact need to break the document into manageable parts? -----Original Message----- From: Pasha Bizhan [mailto:[EMAIL PROTECTED] Sent: 09 June 2006 21:35 To: java-user@lucene.apache.org Subject: RE: Problems indexing large documents Hi, > From: manu mohedano [mailto:[EMAIL PROTECTED] > Hi All! I have a trouble... When I index text documents in > english, there is no problem, buy when I index Spanish text > documents (And they're big), a lot of information from the > document don't become indexed (I suppose it is due to the > Analyzer, but if the documents is less tahn 400kb it works > perfectly). Howewer I want to Index ALL the strings in the > document with no StopWords. Is this possible?? Read javadoc about DEFAULT_MAX_FIELD_LENGTH at http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.h tml#setMaxFieldLength(int) Pasha Bizhan --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]