Brian Rook <[EMAIL PROTECTED]> writes:
> The site I'm working on has a lot of small html files that are used for page
> construction (nav bars, footers, etc) and they're being returned high in the
> results because they contain the search term(s) I'm looking for and are
> small so they rank higher than larger documents.
>
> I want to exclude them from the index and I've come up with two ideas:
>
> 1) move them to a directory, which I will exclude from the index, but I'll
> have to change a bunch of links
>
> 2) detect them with some sort of flag and exclude them from the index. We
> were thinking that we could have a fake tag that lucene would detect and not
> index those pages.
Why not just have an exclude list of some sort? In the code you
wrote to select files for indexing, just have it check against a list
of files you want to exclude. In the demo application, you would edit
jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java
The quick and dirty method would be to edit this section of code:
public static void indexDocs(IndexWriter writer, File file)
throws Exception {
if (file.isDirectory()) {
String[] files = file.list();
for (int i = 0; i < files.length; i++)
indexDocs(writer, new File(file, files[i]));
} else {
System.out.println("adding " + file);
writer.addDocument(FileDocument.Document(file));
}
}
To something like this:
public static void indexDocs(IndexWriter writer, File file)
throws Exception {
if (file.isDirectory()) {
String[] files = file.list();
for (int i = 0; i < files.length; i++)
indexDocs(writer, new File(file, files[i]));
} else {
if (checkFileName(file)) {
System.out.println("skipping " + file) ;
} else {
System.out.println("adding " + file);
writer.addDocument(FileDocument.Document(file));
}
}
}
public static boolean checkFileName(File file) {
String name = file.getName() ;
if (name == "footer.html" ||
name == "header.html" ||
name == "menu.html" ||
name == "navbar.html") {
return false ;
}
return true ;
}
A more realistic implementation would use an "exclude file" of
filenames to ignore, load them into a collection (probably a HashSet)
and keep that collection around as an instance variable. Then
checkFileName() just returns !excludedSet.contains(name).
Steven J. Owens
[EMAIL PROTECTED]
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>