It's my understanding that the tokens in a token_stream consist of text
along with start/stop positions that represent the byte positions of the
text within the corresponding document field. The documentation I've been
reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte
positions represent positions within the entire field but based on my
testing it appears that the byte positions are with respect to the line that
contains the corresponding text within the field. I read my fields following
Brian McCallister:

      index.add_document :file => path,
                         :content => file.readlines


Hence, if I have a file that contains carriage returns, the token positions
will be reset with each new line. For example, the following file contents
(File A)
          this is a sentence
will result in a token for the text "sentence" with start position equal to
10 (assume "this" starts in position 0) while a file with a carriage return
          this is a
          sentence
will result in a token for the text "sentence" with start position equal to
0. I get the same results for my custom tokenizer as well as
StandardTokenizer. The above does not seem consistent with the documentation
but more importantly, it seems that global positions are more useful than
line-based positions (e.g., for highlighting).

Digging a little deeper it seems that the tokenizer's initialize method is
called each time the token_stream method of the containing analyzer is
called:

class CustomAnalyzer
  def token_stream(field, str)
    ts = StandardTokenizer.new(str)
  end
end

Am I missing something here? Are the start/stop byte positions intended to
be with respect to the line? Is there a way for token_stream to only be
called once for an entire string sequence (even if carriage returns are
contained)?

Thanks,
John
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to