What you are doing is really the job of an Analyzer. You are doing pre-analysis, when instead you could do all of this within the context of a custom analyzer and avoid many of these issues altogether.

Do you use the XML only during indexing? If so, you could bypass the whole conversion to XML and then back through Digester all within an analyzer.

Or am I missing something that prevents you from doing it this way?

Erik


On Feb 28, 2004, at 10:05 PM, Roy Klein wrote:
Erik,
Here's a brief example of the type of thing I'm trying to do:

I have a file that contains the words:

The quick brown fox jumped over the lazy dog.

I run that file through a utility that produces the following xml
document:
<document>
  <field name=wordposition1>
    <word>The</word>
  </field>
  <field name=wordposition2>
    <word>quick</word>
    <word>fast</word>
    <word>speedy</word>
  </field>
  <field name=wordposition3>
    <word>brown</word>
    <word>tan</word>
    <word>dark</word>
  </field>
  .
  .
  .

I parse that document (via the digester), and add all the words from
each of the fields to one lucene field: "contents".  The tricky part is
that I want to have each word position contain all the words at that
position in the lucene index.  I.e. word location 1 in the index
contains "The", word location 2: "quick, fast, and speedy", word
location 3: "brown, tan, and dark", etc.

That way, all the following phrase queries will match this document:
        "fast tan"
        "quick brown"
      "fast brown"

I wrote a "TermAnalyzer" that adds all the words from a field into the
index at the same position. (via setPositionIncrement(0)). That way I
can simply add each set of words to the "contents" field, and it'll just
keep adding them to the same field. However, since it's reversing them,
I can't match phrases.



Roy


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to