Brian Rook <[EMAIL PROTECTED]> writes: > The site I'm working on has a lot of small html files that are used for page > construction (nav bars, footers, etc) and they're being returned high in the > results because they contain the search term(s) I'm looking for and are > small so they rank higher than larger documents. > > I want to exclude them from the index and I've come up with two ideas: > > 1) move them to a directory, which I will exclude from the index, but I'll > have to change a bunch of links > > 2) detect them with some sort of flag and exclude them from the index. We > were thinking that we could have a fake tag that lucene would detect and not > index those pages.
Why not just have an exclude list of some sort? In the code you wrote to select files for indexing, just have it check against a list of files you want to exclude. In the demo application, you would edit jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java The quick and dirty method would be to edit this section of code: public static void indexDocs(IndexWriter writer, File file) throws Exception { if (file.isDirectory()) { String[] files = file.list(); for (int i = 0; i < files.length; i++) indexDocs(writer, new File(file, files[i])); } else { System.out.println("adding " + file); writer.addDocument(FileDocument.Document(file)); } } To something like this: public static void indexDocs(IndexWriter writer, File file) throws Exception { if (file.isDirectory()) { String[] files = file.list(); for (int i = 0; i < files.length; i++) indexDocs(writer, new File(file, files[i])); } else { if (checkFileName(file)) { System.out.println("skipping " + file) ; } else { System.out.println("adding " + file); writer.addDocument(FileDocument.Document(file)); } } } public static boolean checkFileName(File file) { String name = file.getName() ; if (name == "footer.html" || name == "header.html" || name == "menu.html" || name == "navbar.html") { return false ; } return true ; } A more realistic implementation would use an "exclude file" of filenames to ignore, load them into a collection (probably a HashSet) and keep that collection around as an instance variable. Then checkFileName() just returns !excludedSet.contains(name). Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>