> From: Lee Mallabone [mailto:[EMAIL PROTECTED]] > > > > How did the title ever get indexed as the title? > > I'm indexing HTML documents marked up with comments to indicate field > boundaries. So I'd typically have: > > <!--field:section_title--> > blurb > <!--field:text--> > more blurb > > and so on. The documents were indexed by looking for each field marker > and then adding the subsequent lines to the relevant field. > > In order to obtain a generic solution for context generation
If you're doing application-specific processing to extract fields from documents, then a completely generic solution for extracting hit context from documents is, by definition, impossible, since context extraction requires field extraction. > are you > suggesting I write a method that takes plain text, (eg, text form of > document) and a query, and assumes the plain text is in the query's > default field? I'm not exactly sure what you're proposing here, but, no, it doesn't sound like something that I have suggested. > This doesn't seem quite as useful as getContext(Hashset queryTerms, > Reader originalDocument); which is what I was originally > aiming towards. Such a method is easy to define if the Reader contains text from a single field. (Although you should probably pass in an Analyzer too.) However if you're expecting such a method to automatically divide the text into fields, then things will be harder, since Lucene's model is that applications divide documents into fields. So you could write an application-specific version that divides fields automatically, or, to use more generic code, you could call such a generic method once for each field of your document, leaving field extraction in application-specific code. Does that make sense? Doug
