=?iso-8859-1?q?=5BJakarta_Lucene_Wiki=5D_Updated=3A__LuceneFAQ?=

lucene-cvs Wed, 22 Dec 2004 15:19:40 -0800

   Date: 2004-12-22T15:19:31
   Editor: DanielNaber
   Wiki: Jakarta Lucene Wiki
   Page: LuceneFAQ
   URL: http://wiki.apache.org/jakarta-lucene/LuceneFAQ


   new FAQ -- still a bit work in progress

Change Log:

------------------------------------------------------------------------------
@@ -1,6 +1,495 @@
-= FAQs =
+This FAQ is currently being worked on (2004-12-22), the update should be done 
in a few days.
 
- There are two official FAQs for Lucene:
+Note that the [http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi old 
FAQ] isn't maintained anymore.
+
+[[TableOfContents]]
+
+== FAQ ==
+
+=== General ===
+
+==== What is the URL of Lucene's home page? ====
+
+Lucene's home is at The Jakarta Project: http://jakarta.apache.org/lucene/.
+
+
+==== Are there any mailing lists available? ====
+
+There's a user list and a developer list, both available at 
http://jakarta.apache.org/site/mail2.html#Lucene
+
+
+==== What Java version is required to run Lucene? ====
+
+Lucene will run with JDK 1.1.8 and up. (???)
+
+
+==== Will Lucene work with my Java application? ====
+
+Yes, Lucene has no external dependencies.
+
+
+==== Where can I get the javadocs for the org.apache.lucene classes? ====
+
+The docs for all the classes are available online at 
http://jakarta.apache.org/lucene/docs/api/. In addition, they are a part of the 
standard distribution, and you can always recreate them by running `ant 
javadocs`.
+
+
+==== Why can't I use Lucene with IBM JDK 1.3.1? ====
+
+Apparently there is a bug in IBM's JIT code in JDK 1.3.1.
+To work around it, disable JIT for the 
`org.apache.lucene.store.OutputStream.writeInt` method by setting the following 
environment variable:
+
+`JITC_COMPILEOPT=SKIP{org/apache/lucene/store/OutputStream}{writeInt}`
+
+
+==== Where does the name Lucene come from? ====
+
+Lucene is Doug Cutting's wife's middle name, and her maternal grandmother's 
first name.
+
+
+==== Are there any alternatives to Lucene? ====
+
+Besides commercial products which we don't know much about there's also 
[http://www.egothor.org Egothor].
+
+
+==== Does Lucene have a web crawler? ====
+
+No, check out the [http://java-source.net/open-source/crawlers list of Open 
Source Crawlers in Java].
+
+ 
+
+=== Searching ===
+
+==== What wildcard search support is available from Lucene? ====
+
+Lucene supports wild card queries which allow you to perform searches such as 
''book*'', which will find documents containing terms such as ''book'', 
''bookstore'', ''booklet'', etc. Lucene refers to this type of a query as a 
'prefix query'.
+
+Lucene also supports wild card queries which allow you to place a wild card in 
the middle of the query term. For instance, you could make searches like: 
''mi*pelling''. That will match both ''misspelling'', which is the correct way 
to spell this word, as well as ''mispelling'', which is a common spelling 
mistake.
+
+Another wild card character that you can use is '?', a question mark.  The ? 
will match a single character.  This allows you to perform queries such as 
''Bra?il''. Such a query will match both ''Brasil'' and ''Brazil''.  Lucene 
refers to this type of a query as a 'wildcard query'.
+
+'''Note''': Leading wildcards (e.g. ''*ook'') are '''not''' supported by the 
QueryParser.
+
+
+==== Is the QueryParser thread-safe? ====
+
+Yes, `QueryParser` is thread-safe.  Its static `parse` method creates a new 
instance of `QueryParser` each time it is called. (??? so is it thread safe 
only for the static method?)
+
+
+==== How do I restrict searches to only return results from a limited subset 
of documents in the index (e.g. for privacy reasons)? What is the best way to 
approach this? ====
+
+The QueryFilter 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/QueryFilter.html
 class is designed precisely for such cases.
+
+Another way of doing it is the following:
+
+Just before calling `IndexSearcher.search()` add a clause to the query to 
exclude documents in categories not permitted for this search.
+
+If you are restricting access with a prohibited term, and someone tries to 
require that term, then the prohibited restriction wins. If you are restricting 
access with a required term, and they try prohibiting that term, then they will 
get no documents in their search result.
+
+As for deciding whether to use required or prohibited terms, if possible,
+you should choose the method that names the less frequent term.  That will
+make queries faster.
+
+
+==== What is the order of fields returned by Document.fields()? ====
+
+Fields are returned in the same order they were added to the document.
+
+
+==== How does one determine which documents do not have a certain term? ====
+
+There is no direct way of doing that.  You could add a term "x" to every 
document, and then search for "+x -y" to find all of the documents that don't 
have "y". Note that for large collections this would be slow because of the 
high term frequency for term "x".
+
+
+==== How do I get the last document added that has a particular term? ====
+
+Call:
+
+`TermDocs td = IndexReader.termDocs(Term);`
+
+Then grab the last `Term` in `TermDocs` that this method returns.
+
+
+==== Does MultiSearcher do anything particularly efficient to search multiple 
indices or does it simply search one after the other? ====
+
+`MultiSearcher` searches indices sequentially. Use ParallelMultiSearcher as a 
searcher that performs multiple searches in parallel.
+
+
+==== Is there a way to use a proximity operator (like near or within) with 
Lucene? ====
+
+There is a variable called `slop` in `PhraseQuery` that allows you to perform 
NEAR/WITHIN-like queries.
+
+By default, `slop` is set to 0 so that only exact phrases will match.
+However, you can alter the value using the `setSlop(int)` method.
+
+When using QueryParser you can use this syntax to specify the slop: "doug 
cutting"~2 will find documents that contain "doug cutting" as well as ones that 
contain "cutting doug".
+
+
+==== Are Wildcard, Prefix, and Fuzzy queries case sensitive? ====
+
+Not, but unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy 
queries are not passed through the `Analyzer`, which is the component that 
performs operations such as stemming.
+
+The reason for skipping the `Analyzer` is that if you were searching for 
''"dogs*"'' you would not want ''"dogs"'' first stemmed to ''"dog"'', since 
that would then match ''"dog*"'', which is not the
+intended query.
+
+
+==== Why does IndexReader's maxDoc() return an 'incorrect' number of documents 
sometimes? ====
+
+According to the Javadoc for `IndexReader` `maxDoc()` method ''"returns one 
greater than the largest possible document number".''
+
+In other words, the number returned by `maxDoc()` does not necessarily match 
the actual number of undeleted documents in the index.
+
+Deleted documents do not get removed from the index immediately, unless you 
call `optimize()`.
+
+
+==== Is there a way to get a text summary of an indexed document with Lucene? 
====
+
+You could store the documents summary in the index and then use the 
Highlighter from the sandbox.
+
+
+==== Can I search an index while it is being optimized? ====
+
+Yes, an index can be searched and optimized simultaneously.
+
+
+==== Can I cache search results with Lucene? ====
+
+Lucene does come with a simple cache mechanism, if you use 
[http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Filter.html 
Lucene Filters] .
+The classes to look at are 
[http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/CachingWrapperFilter.html
 CachingWrapperFilter] and 
[http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/QueryFilter.html
 QueryFilter].
+
+
+==== Is the IndexSearcher thread-safe? ====
+
+'''Yes''', IndexSearcher is thread-safe.  Multiple search threads may access 
the index concurrently without any problems.
+
+
+==== Is there a way to retrieve the original term positions during the search? 
====
+
+Yes, see the Javadoc for `IndexReader.termPositions()`.
+
+
+==== How do I retrieve all the values of a particular field that exists within 
an index, across all documents? ====
+
+The trick is to enumerate terms with that field.  Terms are sorted first 
+by field, then by text, so all terms with a given field are adjacent in 
+enumerations.  Term enumeration is also efficient.
+
+{{{
+try
+{
+    TermEnum terms = indexReader.terms(new Term("FIELD-NAME-HERE", ""));
+    while ("FIELD-NAME-HERE".equals(enum.term().field()))
+    {
+        // ... collect enum.term().text() ...
+
+        if (!terms.next())
+            break;
+    }
+}
+finally
+{
+    terms.close();
+}
+}}}
+
+
+==== Can Lucene do a "search within search", so that the second search is 
constrained by the results of the first query? ====
+
+Yes.  There are two primary options:
+
+ * Use `QueryFilter` with the previous query as the filter. (you can search 
the mailing list archives for `QueryFilter` and Doug Cutting's recommendations 
against using it for this purpose)
+ * Combine the previous query with the current query using `BooleanQuery`, 
using the previous query as required.
+
+The `BooleanQuery` approach is the recommended one.
+
+
+==== Does the position of the matches in the text affects the scoring? ====
+
+No, the position of matches within a field does not affect ranking.
+
+
+==== How do I make sure that a match in a document title has greater weight 
than than a match in a document body? ====
+
+If you put the title in a separate field from the body, and search both 
fields, matches in the title will usually be stronger without explicit 
boosting. This is because the scores are normalized by the length of the field, 
and the title tends to be much shorter than the body.  Therefore, even without 
boosting, title matches usually come before body matches.
+
+
+
+=== Indexing ===
+
+==== Can I use Lucene to crawl my site or other sites on the Internet? ====
+
+No. Lucene does not know how to access external document, nor does it know how 
to extract the content and links of HTML and other document format. Lucene 
focus on the indexing and searching and does it great. However, several 
crawlers are available which you could use: 
[http://java-source.net/open-source/crawlers list of Open Source Crawlers in 
Java]
+
+
+==== How do I perform a simple indexing of a set of documents? ====
+
+The easiest way is to re-index the entire document set periodically or 
whenever it changes. All you need to do is to create an instance of 
IndexWriter(), iterate over your document set, create for each document a 
Lucene Document object and add it to the IndexWriter. When you are done make 
sure to close the IndexWriter. This will release all of its resources and will 
close the files it created. 
+
+
+==== How can I add document(s) to the index? ====
+
+Simply create an IndexWriter and use its addDocument() method. Make sure to 
create the IndexWriter with the 'create' flag set to false and make sure to 
close the IndexWriter when you are done adding the documents.
+
+
+==== Where does Lucene store the index it builds? ====
+
+Typically, the index is stored in a set of files that Lucene creates in a 
directory of your choice. If your system uses multiple independent indices, 
simply create an separate directory for each index. 
+
+Lucene's API also provide a way to use or implement other storage methods such 
as a nonresistance in-memory storage, or a mapping of Lucene data to any third 
party database.
+
+
+==== Does Lucene store a full copy of the indexed documents? ====
+
+It is up to you. You can tell Lucene what document information to use just for 
indexing and what document information to also store in the index (with or 
without indexing).
+
+
+==== What happens when you IndexWriter.add() a document that is already in the 
index?  Does it overwrite the previous document? ====
+
+No, there will be multiple copies of the same document in the index.
+
+
+==== How do I delete documents from the index? ====
+
+If you know the document number of a document that you want to delete you may 
use:
+
+`IndexReader.delete(docNum)`
+
+That will delete the document numbered `docNum` from the index.  Once a 
document is deleted it will not appear in `TermDocs` nor `TermPositions` 
enumerations.
+
+Attempts to read its field with the `document` method will result in an error. 
 The presence of this document may still be reflected in the `docFreq` 
statistic, though this will be corrected eventually as the index is further 
modified.
+
+If you want to delete all (1 or more) documents that contain a specific term 
you may use:
+
+`IndexReader.delete(Term)`
+
+This is useful if one uses a document field to hold a unique ID string for
+the document.  Then to delete such a document, one merely constructs a
+term with the appropriate field and the unique ID string as its text and
+passes it to this method. Because a variable number of document can be 
affected by this method call this method returns the number of documents 
deleted.
+
+
+==== Is there a way to limit the size of an index? ====
+
+This question is sometimes brought up because of the 2GB file size limit of 
some 32-bit operating systems.
+
+This is a slightly modified answer from Doug Cutting:
+
+The easiest thing is to set `IndexWriter.maxMergeDocs`.
+
+If, for instance, you hit the 2GB limit at 8M documents set `maxMergeDocs` to 
7M.  That will keep Lucene from trying to merge an index that won't fit in your 
filesystem.  It will actually effectively round this down to the next lower 
power of `Index.mergeFactor`.
+
+So with the default `mergeFactor` set to 10 and `maxMergeDocs` set to 7M 
Lucene will generate a series of 1M document indexes, since merging 10 of these 
would exceed the maximum.
+
+A slightly more complex solution:
+
+You could further minimize the number of segments if, when you've added 7M 
documents, optimize the index and start a new index.  Then use `MultiSearcher` 
to search the indexes.
+
+An even more complex and optimal solution:
+
+Write a version of `FSDirectory` that, when a file exceeds 2GB, creates a 
subdirectory and represents the file as a series of files.
+
+
+==== Why is it important to use the same analyzer type during indexing and 
search? ====
+
+The analyzer controls how the text is broken into terms which are then used to 
index the document. If you are using analyzer of one type to index and an 
analyzer of a different type to parse the search query, it is possible that the 
same word will be mapped to two different terms and this will result in missing 
or false hits. 
+
+
+==== What is index optimization and when should I use it? ====
+
+The IndexWriter class supports an optimize() method that compacts the index 
database and speedup queries. You may want to use this method after performing 
a complete indexing of your document set or after incremental updates of the 
index. If your incremental update adds documents frequently, you want to 
perform the optimization only once in a while to avoid the extra overhead of 
the optimization.
+
+==== What are Segments? ====
+
+The index database is composed of 'segments' each stored in a separate file. 
When you add documents to the index, new segments may be created. You can 
compact the database and reduce the number of segments by optimizing it (see a 
separate question regarding index optimization). 
+ 
+
+==== Is Lucene index database platform independent? ====
+
+Yes, you can copy a Lucene index directory from one platform to another and it 
will work just as well.
+ 
+
+==== When I recreate an index from scratch, do I have to delete the old index 
files? ====
+
+No, creating the index writer with "true" should remove all old files in the 
old index.
+
+ 
+==== How can I index and search digits and other non-alphabetic characters? 
====
+
+The components responsible for this are various `Analyzers.`
+
+The demos included in Lucene distribution use `StopAnalyzer`, which filters 
out non-alphabetic characters. To include non-alphabetic characters, such as 
digits and various punctuation characters in your index use 
`org.apache.lucene.analysis.standard.StandardAnalyzer` instead of 
`StopAnalyzer`.
+
+
+==== Is the IndexWriter class, and especially the method 
addIndexes(Directory[]) thread safe? ====
+
+Yes, `IndexWriter.addIndexes(Directory[])` method is thread safe.  It is a 
`final synchronized` method.
+
+
+==== Do document IDs change after merging indices or after document deletion? 
====
+
+Yes, document IDs do change.
+
+
+==== What is the purpose of write.lock file, when is it used, and by which 
classes? ====
+
+The write.lock is used to keep processes from concurrently attempting
+to modify an index. 
+
+It is obtained by an `IndexWriter` while it is open, and by an `IndexReader` 
once documents have been deleted and until it is closed.
+
+
+==== What is the purpose of the commit.lock file, when is it used, and by 
which classes? ====
+
+The commit.lock file is used to coordinate the contents of the 'segments'
+file with the files in the index.  It is obtained by an `IndexReader` before 
it reads the 'segments' file, which names all of the other files in the
+index, and until the `IndexReader` has opened all of these other files.
+
+The commit.lock is also obtained by the `IndexWriter` when it is about to 
write the segments file and until it has finished trying to delete obsolete 
index files.
+
+The commit.lock should thus never be held for long, since while
+it is obtained files are only opened or deleted, and one small file is
+read or written.
+
+
+==== Is there a maximum number of segment infos whose summary (name and 
document count) is stored in the segments file? ====
+
+All segments in the index are listed in the segments file.  There is no hard 
limit. For an un-optimized index it is proportional to the log of the number of 
documents in the index. An optimized index contains a single segment.
+
+
+==== What happens when I open an IndexWriter, optimize the index, and then 
close the IndexWriter?  Which files will be added or modified? ====
+
+All of the segments are merged into a single new segment file.
+If the index was empty to begin with, no segments will be created, only the 
`segments` file.
+
+
+==== If I decide not to optimize the index, when will the deleted documents 
actually get deleted? ====
+
+Document that are deleted really are in deleted (???).  However, the space 
they consume in the index does not get reclaimed until the index is optimized.  
That space will also eventually be reclaimed as more documents are added to the 
index, even if the index does not get optimized.
+
+
+==== How do I update a document or a set of documents that are already 
indexed? ====
+
+There is no direct update procedure in Lucene. To update an index 
incrementally you must first '''delete''' the documents that were updated, and 
'''then re-add''' them to the index.
+
+
+==== How do I write my own Analyzer? ====
+
+Here is an example:
+
+{{{
+public class MyAnalyzer extends Analyzer
+{
+    private static final Analyzer STANDARD = new StandardAnalyzer();
+
+    public TokenStream tokenStream(String field, final Reader reader) 
+    {
+        // do not tokenize field called 'element'
+        if ("element".equals(field)) {
+            return new CharTokenizer(reader) {
+                protected boolean isTokenChar(char c) {
+                    return true;
+                }
+            };
+        } else {
+            // use standard analyzer
+            return STANDARD.tokenStream(field, reader);
+        }
+    }
+}
+}}}
+
+
+==== How do I index non Latin characters? ====
+
+The solution is to ensure that the query string is encoded the same way that 
strings in the index are. For instance, something along the lines of this will 
work if your index is also using UTF-8 encoding.
+
+{{{
+String queryStr = new String("query string here".getBytes("UTF-8"));
+}}}
+
+
+==== How can I index HTML documents? ====
+
+In order to index HTML documents you need to first parse them to extract text 
that you want to index from them.  Here are some HTML parsers that can help you 
with that:
+
+An example that uses JavaCC to parse HTML into Lucene Document  objects is 
provided in the [http://jakarta.apache.org/lucene/docs/demo3.html Lucene web 
application demo] that comes with the Lucene distribution.
+
+The [http://www.apache.org/~andyc/neko/doc/html/ CyberNeko HTML Parser] lets 
you parse HTML documents. It's relatively easy to remove most of the tags from 
an HTML document (or all if you want), and then use the ones you left in to 
help create metadata for your Lucene document. NekoHTML also provides a DOM 
model for navigating through the HTML.
+
+[http://jtidy.sourceforge.net/ JTidy] cleans up HTML, and can provide a DOM 
interface to the HTML files through a Java API.
+
+
+==== How can I index XML documents? ====
+
+In order to index XML documents you need to first parse them to extract text 
that you want to index from them.  Here are some XML parsers that can help you 
with that:
+
+See the 
[http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/XML-Indexing-Demo/
 XML Demo].  This contribution is some sample code that demonstrates adding 
simple XML documents into the index.  It creates a new Document object for each 
file, and then populates the Document with a Field for each XML element, 
recursively. There are examples included for both SAX and DOM. 
+
+See article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsing, 
indexing, and searching XML with Digester and Lucene].
+
+
+==== How can I index MS-Word documents? ====
+
+In order to index Word documents you need to first parse them to extract text 
that you want to index from them.  Here are some Word parsers that can help you 
with that:
+
+[http://jakarta.apache.org/poi/ Jakarta Apache POI] has an early development 
level Microsoft Word parser for versions of Word from Office 97, 2000, and XP.
+
+[http://www.textmining.org/ Simple Text Extractor Library], relies on POI.
+
+
+==== How can I index MS-Excel documents? ====
+
+In order to index Excel documents you need to first parse them to extract text 
that you want to index from them.  Here are some Excel parsers that can help 
you with that:
+
+[http://jakarta.apache.org/poi/ Jakarta Apache POI] has an excellent Microsoft 
Excel parser for versions of Excel from Office 97, 2000, and XP.  You can also 
modify Excel files with this tool.
+
+
+==== How can I index MS-Powerpoint documents? ====
+
+In order to index Powerpoint documents you need to first parse them to extract 
text that you want to index from them.  You can use the 
[http://jakarta.apache.org/poi/ Jakarta Apache POI], as it contains a parser 
for Powerpoint documents.
+
+
+==== How can I index RTF documents? ====
+
+In order to index RTF documents you need to first parse them to extract text 
that you want to index from them.  Here are some RTF parsers that can help you 
with that:
+
+[http://www.tetrasix.com/ MajiX] is a translation utility that will turn RTF 
(Rich Text Format) files into XML files. These XML files could be indexed like 
any other XML file, or you could write some custom code. (??? doesn't seem to 
exist anymore -- mention Java's Swing widget instead that can be used to access 
RTF)
+
+
+==== How can I index PDF documents? ====
+
+In order to index PDF documents you need to first parse them to extract text 
that you want to index from them.  Here are some PDF parsers that can help you 
with that:
+
+[http://pdfbox.org/ PDFBox] is a Java API from Ben Litchfield that will let 
you access the contents of a PDF document. It comes with integration classes 
for Lucene to translate a PDF into a Lucene document.
+
+[http://www.foolabs.com/xpdf/ XPDF]  is an open source tool that is licensed 
under the GPL. It's not a Java tool, but there is a utility called pdftotext 
that can translate PDF files into text files on most platforms from the command 
line.
+
+Based on xpdf, there is a utility called [http://pdftohtml.sourceforge.net/ 
pdftohtml] that can translate PDF files into HTML files. This is also not a 
Java application.
+
+[http://www.jpedal.org/ JPedal] is a Java API for extracting text and images 
from PDF documents.
+
+
+==== How can I index JSP files? ====
+
+To index the content of JSPs that a user would see using a Web browser, you 
would need to write an application that acts as a Web client, in order to mimic 
the Web browser behaviour (i.e. a web crawler).  Once you have such an 
application, you should be able to point it to the desired JSP, retrieve the 
contents that the JSP generates, parse it, and feed it to Lucene. See 
[http://java-source.net/open-source/crawlers list of Open Source Crawlers in 
Java].
+
+How to parse the output of the JSP depends on the type of content that the JSP 
generates.  In most cases the content is going to be in HTML format.
+
+Most importantly, do not try to index JSPs by treating them as normal files in 
your file system.  In order to index JSPs properly you need to access them via 
HTTP, acting like a Web client.
+
+
+==== If I use a compound file-style index, do I still need to optimize my 
index? ====
+
+Yes.  Each .cfs file created in the compound file-style index represents a 
single segment, which means you can still merge multiple segments into a single 
segment by optimizing the index.
+
+
+==== What is the difference between IndexWriter.addIndexes(IndexReader[]) and 
IndexWriter.addIndexes(Directory[]), besides them taking different arguments? 
====
+
+When merging lots of indexes (more than the mergeFactor), the Directory-based 
method will use fewer file handles and less memory, as it will only ever open 
mergeFactor indexes at once, while the IndexReader-based method requires that 
all indexes be open when passed.
+
+The primary advantage of the IndexReader-based method is that one can pass it 
IndexReaders that don't reside in a Directory.
+
+
+==== Can I use Lucene to index text in Chinese, Japanese, Korean, and other 
multi-byte character sets? ====
+
+Yes, you can.  Lucene is not limited to English, nor any other language.  To 
index text properly, you need to use an Analyzer appropriate for the language 
of the text you are indexing.  Lucene's default Analyzers work well for 
English.  There are a number of other Analyzers in 
[http://jakarta.apache.org/lucene/docs/lucene-sandbox/ Lucene Sandbox], 
including those for Chinese, Japanese, and Korean.
 
- ||Lucene FAQ at JGuru||[http://www.jguru.com/faq/Lucene]||
- ||Original Lucene FAQs, no longer 
maintained||[http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi]||

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

=?iso-8859-1?q?=5BJakarta_Lucene_Wiki=5D_Updated=3A__LuceneFAQ?=

Reply via email to