Date: 2004-12-22T15:19:31 Editor: DanielNaber Wiki: Jakarta Lucene Wiki Page: LuceneFAQ URL: http://wiki.apache.org/jakarta-lucene/LuceneFAQ
new FAQ -- still a bit work in progress Change Log: ------------------------------------------------------------------------------ @@ -1,6 +1,495 @@ -= FAQs = +This FAQ is currently being worked on (2004-12-22), the update should be done in a few days. - There are two official FAQs for Lucene: +Note that the [http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi old FAQ] isn't maintained anymore. + +[[TableOfContents]] + +== FAQ == + +=== General === + +==== What is the URL of Lucene's home page? ==== + +Lucene's home is at The Jakarta Project: http://jakarta.apache.org/lucene/. + + +==== Are there any mailing lists available? ==== + +There's a user list and a developer list, both available at http://jakarta.apache.org/site/mail2.html#Lucene + + +==== What Java version is required to run Lucene? ==== + +Lucene will run with JDK 1.1.8 and up. (???) + + +==== Will Lucene work with my Java application? ==== + +Yes, Lucene has no external dependencies. + + +==== Where can I get the javadocs for the org.apache.lucene classes? ==== + +The docs for all the classes are available online at http://jakarta.apache.org/lucene/docs/api/. In addition, they are a part of the standard distribution, and you can always recreate them by running `ant javadocs`. + + +==== Why can't I use Lucene with IBM JDK 1.3.1? ==== + +Apparently there is a bug in IBM's JIT code in JDK 1.3.1. +To work around it, disable JIT for the `org.apache.lucene.store.OutputStream.writeInt` method by setting the following environment variable: + +`JITC_COMPILEOPT=SKIP{org/apache/lucene/store/OutputStream}{writeInt}` + + +==== Where does the name Lucene come from? ==== + +Lucene is Doug Cutting's wife's middle name, and her maternal grandmother's first name. + + +==== Are there any alternatives to Lucene? ==== + +Besides commercial products which we don't know much about there's also [http://www.egothor.org Egothor]. + + +==== Does Lucene have a web crawler? ==== + +No, check out the [http://java-source.net/open-source/crawlers list of Open Source Crawlers in Java]. + + + +=== Searching === + +==== What wildcard search support is available from Lucene? ==== + +Lucene supports wild card queries which allow you to perform searches such as ''book*'', which will find documents containing terms such as ''book'', ''bookstore'', ''booklet'', etc. Lucene refers to this type of a query as a 'prefix query'. + +Lucene also supports wild card queries which allow you to place a wild card in the middle of the query term. For instance, you could make searches like: ''mi*pelling''. That will match both ''misspelling'', which is the correct way to spell this word, as well as ''mispelling'', which is a common spelling mistake. + +Another wild card character that you can use is '?', a question mark. The ? will match a single character. This allows you to perform queries such as ''Bra?il''. Such a query will match both ''Brasil'' and ''Brazil''. Lucene refers to this type of a query as a 'wildcard query'. + +'''Note''': Leading wildcards (e.g. ''*ook'') are '''not''' supported by the QueryParser. + + +==== Is the QueryParser thread-safe? ==== + +Yes, `QueryParser` is thread-safe. Its static `parse` method creates a new instance of `QueryParser` each time it is called. (??? so is it thread safe only for the static method?) + + +==== How do I restrict searches to only return results from a limited subset of documents in the index (e.g. for privacy reasons)? What is the best way to approach this? ==== + +The QueryFilter http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/QueryFilter.html class is designed precisely for such cases. + +Another way of doing it is the following: + +Just before calling `IndexSearcher.search()` add a clause to the query to exclude documents in categories not permitted for this search. + +If you are restricting access with a prohibited term, and someone tries to require that term, then the prohibited restriction wins. If you are restricting access with a required term, and they try prohibiting that term, then they will get no documents in their search result. + +As for deciding whether to use required or prohibited terms, if possible, +you should choose the method that names the less frequent term. That will +make queries faster. + + +==== What is the order of fields returned by Document.fields()? ==== + +Fields are returned in the same order they were added to the document. + + +==== How does one determine which documents do not have a certain term? ==== + +There is no direct way of doing that. You could add a term "x" to every document, and then search for "+x -y" to find all of the documents that don't have "y". Note that for large collections this would be slow because of the high term frequency for term "x". + + +==== How do I get the last document added that has a particular term? ==== + +Call: + +`TermDocs td = IndexReader.termDocs(Term);` + +Then grab the last `Term` in `TermDocs` that this method returns. + + +==== Does MultiSearcher do anything particularly efficient to search multiple indices or does it simply search one after the other? ==== + +`MultiSearcher` searches indices sequentially. Use ParallelMultiSearcher as a searcher that performs multiple searches in parallel. + + +==== Is there a way to use a proximity operator (like near or within) with Lucene? ==== + +There is a variable called `slop` in `PhraseQuery` that allows you to perform NEAR/WITHIN-like queries. + +By default, `slop` is set to 0 so that only exact phrases will match. +However, you can alter the value using the `setSlop(int)` method. + +When using QueryParser you can use this syntax to specify the slop: "doug cutting"~2 will find documents that contain "doug cutting" as well as ones that contain "cutting doug". + + +==== Are Wildcard, Prefix, and Fuzzy queries case sensitive? ==== + +Not, but unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the `Analyzer`, which is the component that performs operations such as stemming. + +The reason for skipping the `Analyzer` is that if you were searching for ''"dogs*"'' you would not want ''"dogs"'' first stemmed to ''"dog"'', since that would then match ''"dog*"'', which is not the +intended query. + + +==== Why does IndexReader's maxDoc() return an 'incorrect' number of documents sometimes? ==== + +According to the Javadoc for `IndexReader` `maxDoc()` method ''"returns one greater than the largest possible document number".'' + +In other words, the number returned by `maxDoc()` does not necessarily match the actual number of undeleted documents in the index. + +Deleted documents do not get removed from the index immediately, unless you call `optimize()`. + + +==== Is there a way to get a text summary of an indexed document with Lucene? ==== + +You could store the documents summary in the index and then use the Highlighter from the sandbox. + + +==== Can I search an index while it is being optimized? ==== + +Yes, an index can be searched and optimized simultaneously. + + +==== Can I cache search results with Lucene? ==== + +Lucene does come with a simple cache mechanism, if you use [http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Filter.html Lucene Filters] . +The classes to look at are [http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/CachingWrapperFilter.html CachingWrapperFilter] and [http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/QueryFilter.html QueryFilter]. + + +==== Is the IndexSearcher thread-safe? ==== + +'''Yes''', IndexSearcher is thread-safe. Multiple search threads may access the index concurrently without any problems. + + +==== Is there a way to retrieve the original term positions during the search? ==== + +Yes, see the Javadoc for `IndexReader.termPositions()`. + + +==== How do I retrieve all the values of a particular field that exists within an index, across all documents? ==== + +The trick is to enumerate terms with that field. Terms are sorted first +by field, then by text, so all terms with a given field are adjacent in +enumerations. Term enumeration is also efficient. + +{{{ +try +{ + TermEnum terms = indexReader.terms(new Term("FIELD-NAME-HERE", "")); + while ("FIELD-NAME-HERE".equals(enum.term().field())) + { + // ... collect enum.term().text() ... + + if (!terms.next()) + break; + } +} +finally +{ + terms.close(); +} +}}} + + +==== Can Lucene do a "search within search", so that the second search is constrained by the results of the first query? ==== + +Yes. There are two primary options: + + * Use `QueryFilter` with the previous query as the filter. (you can search the mailing list archives for `QueryFilter` and Doug Cutting's recommendations against using it for this purpose) + * Combine the previous query with the current query using `BooleanQuery`, using the previous query as required. + +The `BooleanQuery` approach is the recommended one. + + +==== Does the position of the matches in the text affects the scoring? ==== + +No, the position of matches within a field does not affect ranking. + + +==== How do I make sure that a match in a document title has greater weight than than a match in a document body? ==== + +If you put the title in a separate field from the body, and search both fields, matches in the title will usually be stronger without explicit boosting. This is because the scores are normalized by the length of the field, and the title tends to be much shorter than the body. Therefore, even without boosting, title matches usually come before body matches. + + + +=== Indexing === + +==== Can I use Lucene to crawl my site or other sites on the Internet? ==== + +No. Lucene does not know how to access external document, nor does it know how to extract the content and links of HTML and other document format. Lucene focus on the indexing and searching and does it great. However, several crawlers are available which you could use: [http://java-source.net/open-source/crawlers list of Open Source Crawlers in Java] + + +==== How do I perform a simple indexing of a set of documents? ==== + +The easiest way is to re-index the entire document set periodically or whenever it changes. All you need to do is to create an instance of IndexWriter(), iterate over your document set, create for each document a Lucene Document object and add it to the IndexWriter. When you are done make sure to close the IndexWriter. This will release all of its resources and will close the files it created. + + +==== How can I add document(s) to the index? ==== + +Simply create an IndexWriter and use its addDocument() method. Make sure to create the IndexWriter with the 'create' flag set to false and make sure to close the IndexWriter when you are done adding the documents. + + +==== Where does Lucene store the index it builds? ==== + +Typically, the index is stored in a set of files that Lucene creates in a directory of your choice. If your system uses multiple independent indices, simply create an separate directory for each index. + +Lucene's API also provide a way to use or implement other storage methods such as a nonresistance in-memory storage, or a mapping of Lucene data to any third party database. + + +==== Does Lucene store a full copy of the indexed documents? ==== + +It is up to you. You can tell Lucene what document information to use just for indexing and what document information to also store in the index (with or without indexing). + + +==== What happens when you IndexWriter.add() a document that is already in the index? Does it overwrite the previous document? ==== + +No, there will be multiple copies of the same document in the index. + + +==== How do I delete documents from the index? ==== + +If you know the document number of a document that you want to delete you may use: + +`IndexReader.delete(docNum)` + +That will delete the document numbered `docNum` from the index. Once a document is deleted it will not appear in `TermDocs` nor `TermPositions` enumerations. + +Attempts to read its field with the `document` method will result in an error. The presence of this document may still be reflected in the `docFreq` statistic, though this will be corrected eventually as the index is further modified. + +If you want to delete all (1 or more) documents that contain a specific term you may use: + +`IndexReader.delete(Term)` + +This is useful if one uses a document field to hold a unique ID string for +the document. Then to delete such a document, one merely constructs a +term with the appropriate field and the unique ID string as its text and +passes it to this method. Because a variable number of document can be affected by this method call this method returns the number of documents deleted. + + +==== Is there a way to limit the size of an index? ==== + +This question is sometimes brought up because of the 2GB file size limit of some 32-bit operating systems. + +This is a slightly modified answer from Doug Cutting: + +The easiest thing is to set `IndexWriter.maxMergeDocs`. + +If, for instance, you hit the 2GB limit at 8M documents set `maxMergeDocs` to 7M. That will keep Lucene from trying to merge an index that won't fit in your filesystem. It will actually effectively round this down to the next lower power of `Index.mergeFactor`. + +So with the default `mergeFactor` set to 10 and `maxMergeDocs` set to 7M Lucene will generate a series of 1M document indexes, since merging 10 of these would exceed the maximum. + +A slightly more complex solution: + +You could further minimize the number of segments if, when you've added 7M documents, optimize the index and start a new index. Then use `MultiSearcher` to search the indexes. + +An even more complex and optimal solution: + +Write a version of `FSDirectory` that, when a file exceeds 2GB, creates a subdirectory and represents the file as a series of files. + + +==== Why is it important to use the same analyzer type during indexing and search? ==== + +The analyzer controls how the text is broken into terms which are then used to index the document. If you are using analyzer of one type to index and an analyzer of a different type to parse the search query, it is possible that the same word will be mapped to two different terms and this will result in missing or false hits. + + +==== What is index optimization and when should I use it? ==== + +The IndexWriter class supports an optimize() method that compacts the index database and speedup queries. You may want to use this method after performing a complete indexing of your document set or after incremental updates of the index. If your incremental update adds documents frequently, you want to perform the optimization only once in a while to avoid the extra overhead of the optimization. + +==== What are Segments? ==== + +The index database is composed of 'segments' each stored in a separate file. When you add documents to the index, new segments may be created. You can compact the database and reduce the number of segments by optimizing it (see a separate question regarding index optimization). + + +==== Is Lucene index database platform independent? ==== + +Yes, you can copy a Lucene index directory from one platform to another and it will work just as well. + + +==== When I recreate an index from scratch, do I have to delete the old index files? ==== + +No, creating the index writer with "true" should remove all old files in the old index. + + +==== How can I index and search digits and other non-alphabetic characters? ==== + +The components responsible for this are various `Analyzers.` + +The demos included in Lucene distribution use `StopAnalyzer`, which filters out non-alphabetic characters. To include non-alphabetic characters, such as digits and various punctuation characters in your index use `org.apache.lucene.analysis.standard.StandardAnalyzer` instead of `StopAnalyzer`. + + +==== Is the IndexWriter class, and especially the method addIndexes(Directory[]) thread safe? ==== + +Yes, `IndexWriter.addIndexes(Directory[])` method is thread safe. It is a `final synchronized` method. + + +==== Do document IDs change after merging indices or after document deletion? ==== + +Yes, document IDs do change. + + +==== What is the purpose of write.lock file, when is it used, and by which classes? ==== + +The write.lock is used to keep processes from concurrently attempting +to modify an index. + +It is obtained by an `IndexWriter` while it is open, and by an `IndexReader` once documents have been deleted and until it is closed. + + +==== What is the purpose of the commit.lock file, when is it used, and by which classes? ==== + +The commit.lock file is used to coordinate the contents of the 'segments' +file with the files in the index. It is obtained by an `IndexReader` before it reads the 'segments' file, which names all of the other files in the +index, and until the `IndexReader` has opened all of these other files. + +The commit.lock is also obtained by the `IndexWriter` when it is about to write the segments file and until it has finished trying to delete obsolete index files. + +The commit.lock should thus never be held for long, since while +it is obtained files are only opened or deleted, and one small file is +read or written. + + +==== Is there a maximum number of segment infos whose summary (name and document count) is stored in the segments file? ==== + +All segments in the index are listed in the segments file. There is no hard limit. For an un-optimized index it is proportional to the log of the number of documents in the index. An optimized index contains a single segment. + + +==== What happens when I open an IndexWriter, optimize the index, and then close the IndexWriter? Which files will be added or modified? ==== + +All of the segments are merged into a single new segment file. +If the index was empty to begin with, no segments will be created, only the `segments` file. + + +==== If I decide not to optimize the index, when will the deleted documents actually get deleted? ==== + +Document that are deleted really are in deleted (???). However, the space they consume in the index does not get reclaimed until the index is optimized. That space will also eventually be reclaimed as more documents are added to the index, even if the index does not get optimized. + + +==== How do I update a document or a set of documents that are already indexed? ==== + +There is no direct update procedure in Lucene. To update an index incrementally you must first '''delete''' the documents that were updated, and '''then re-add''' them to the index. + + +==== How do I write my own Analyzer? ==== + +Here is an example: + +{{{ +public class MyAnalyzer extends Analyzer +{ + private static final Analyzer STANDARD = new StandardAnalyzer(); + + public TokenStream tokenStream(String field, final Reader reader) + { + // do not tokenize field called 'element' + if ("element".equals(field)) { + return new CharTokenizer(reader) { + protected boolean isTokenChar(char c) { + return true; + } + }; + } else { + // use standard analyzer + return STANDARD.tokenStream(field, reader); + } + } +} +}}} + + +==== How do I index non Latin characters? ==== + +The solution is to ensure that the query string is encoded the same way that strings in the index are. For instance, something along the lines of this will work if your index is also using UTF-8 encoding. + +{{{ +String queryStr = new String("query string here".getBytes("UTF-8")); +}}} + + +==== How can I index HTML documents? ==== + +In order to index HTML documents you need to first parse them to extract text that you want to index from them. Here are some HTML parsers that can help you with that: + +An example that uses JavaCC to parse HTML into Lucene Document objects is provided in the [http://jakarta.apache.org/lucene/docs/demo3.html Lucene web application demo] that comes with the Lucene distribution. + +The [http://www.apache.org/~andyc/neko/doc/html/ CyberNeko HTML Parser] lets you parse HTML documents. It's relatively easy to remove most of the tags from an HTML document (or all if you want), and then use the ones you left in to help create metadata for your Lucene document. NekoHTML also provides a DOM model for navigating through the HTML. + +[http://jtidy.sourceforge.net/ JTidy] cleans up HTML, and can provide a DOM interface to the HTML files through a Java API. + + +==== How can I index XML documents? ==== + +In order to index XML documents you need to first parse them to extract text that you want to index from them. Here are some XML parsers that can help you with that: + +See the [http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/XML-Indexing-Demo/ XML Demo]. This contribution is some sample code that demonstrates adding simple XML documents into the index. It creates a new Document object for each file, and then populates the Document with a Field for each XML element, recursively. There are examples included for both SAX and DOM. + +See article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsing, indexing, and searching XML with Digester and Lucene]. + + +==== How can I index MS-Word documents? ==== + +In order to index Word documents you need to first parse them to extract text that you want to index from them. Here are some Word parsers that can help you with that: + +[http://jakarta.apache.org/poi/ Jakarta Apache POI] has an early development level Microsoft Word parser for versions of Word from Office 97, 2000, and XP. + +[http://www.textmining.org/ Simple Text Extractor Library], relies on POI. + + +==== How can I index MS-Excel documents? ==== + +In order to index Excel documents you need to first parse them to extract text that you want to index from them. Here are some Excel parsers that can help you with that: + +[http://jakarta.apache.org/poi/ Jakarta Apache POI] has an excellent Microsoft Excel parser for versions of Excel from Office 97, 2000, and XP. You can also modify Excel files with this tool. + + +==== How can I index MS-Powerpoint documents? ==== + +In order to index Powerpoint documents you need to first parse them to extract text that you want to index from them. You can use the [http://jakarta.apache.org/poi/ Jakarta Apache POI], as it contains a parser for Powerpoint documents. + + +==== How can I index RTF documents? ==== + +In order to index RTF documents you need to first parse them to extract text that you want to index from them. Here are some RTF parsers that can help you with that: + +[http://www.tetrasix.com/ MajiX] is a translation utility that will turn RTF (Rich Text Format) files into XML files. These XML files could be indexed like any other XML file, or you could write some custom code. (??? doesn't seem to exist anymore -- mention Java's Swing widget instead that can be used to access RTF) + + +==== How can I index PDF documents? ==== + +In order to index PDF documents you need to first parse them to extract text that you want to index from them. Here are some PDF parsers that can help you with that: + +[http://pdfbox.org/ PDFBox] is a Java API from Ben Litchfield that will let you access the contents of a PDF document. It comes with integration classes for Lucene to translate a PDF into a Lucene document. + +[http://www.foolabs.com/xpdf/ XPDF] is an open source tool that is licensed under the GPL. It's not a Java tool, but there is a utility called pdftotext that can translate PDF files into text files on most platforms from the command line. + +Based on xpdf, there is a utility called [http://pdftohtml.sourceforge.net/ pdftohtml] that can translate PDF files into HTML files. This is also not a Java application. + +[http://www.jpedal.org/ JPedal] is a Java API for extracting text and images from PDF documents. + + +==== How can I index JSP files? ==== + +To index the content of JSPs that a user would see using a Web browser, you would need to write an application that acts as a Web client, in order to mimic the Web browser behaviour (i.e. a web crawler). Once you have such an application, you should be able to point it to the desired JSP, retrieve the contents that the JSP generates, parse it, and feed it to Lucene. See [http://java-source.net/open-source/crawlers list of Open Source Crawlers in Java]. + +How to parse the output of the JSP depends on the type of content that the JSP generates. In most cases the content is going to be in HTML format. + +Most importantly, do not try to index JSPs by treating them as normal files in your file system. In order to index JSPs properly you need to access them via HTTP, acting like a Web client. + + +==== If I use a compound file-style index, do I still need to optimize my index? ==== + +Yes. Each .cfs file created in the compound file-style index represents a single segment, which means you can still merge multiple segments into a single segment by optimizing the index. + + +==== What is the difference between IndexWriter.addIndexes(IndexReader[]) and IndexWriter.addIndexes(Directory[]), besides them taking different arguments? ==== + +When merging lots of indexes (more than the mergeFactor), the Directory-based method will use fewer file handles and less memory, as it will only ever open mergeFactor indexes at once, while the IndexReader-based method requires that all indexes be open when passed. + +The primary advantage of the IndexReader-based method is that one can pass it IndexReaders that don't reside in a Directory. + + +==== Can I use Lucene to index text in Chinese, Japanese, Korean, and other multi-byte character sets? ==== + +Yes, you can. Lucene is not limited to English, nor any other language. To index text properly, you need to use an Analyzer appropriate for the language of the text you are indexing. Lucene's default Analyzers work well for English. There are a number of other Analyzers in [http://jakarta.apache.org/lucene/docs/lucene-sandbox/ Lucene Sandbox], including those for Chinese, Japanese, and Korean. - ||Lucene FAQ at JGuru||[http://www.jguru.com/faq/Lucene]|| - ||Original Lucene FAQs, no longer maintained||[http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi]|| --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]