On Jun 6, 2009, at 9:22 AM, Marvin Humphrey wrote:

On Fri, Jun 05, 2009 at 02:42:46PM -0700, Father Chrysostomos wrote:

First, a bit of good news: I've managed to fix the current KS Highlighter sentence-boundary trimming implementation without needing to start over from scratch, and without causing any problems for the KSx::Highlight::Summarizer test suite. That means we don't have to conclude this discussion and finish the implementation to unblock a KS dev release. (: For better or worse. :)

I don’t know whether you are aware: I cheated and copied & pasted the find_sentence_boundaries from KS r3122 to KSx:H:S, since I was in a hurry.

Would you extend the Analysis interface to allow for custom sentence
algorithms?

Since this is a tokenization task, Analyzer would be a logical place to turn. I think we'll need to make two passes over the text, one for search tokens and
one for sentences.

Dow we actually need to extend Analyzer, though? I think we ought to avoid giving Analyzer a Find_Sentences() method. Instead, we can just create an Analyzer instance which tokenizes at sentence boundaries. Probably we'll want to create a dedicated SentenceTokenizer subclass, which would not be publicly
exposed.

I’ve just had an idea: Since we have 1) words, 2) sentences and 3) pages, why not multiple levels of vector information? Or multiple ‘sets’ (which could be orthogonal/overlapping)? Someone may want to include paragraphs or chapters, for instance. Just a thought....

Instead, we can turn TermVectorsWriter into a public HighlightWriter class and give it a Set_Sentence_Tokenizer() method. Extensibility would happen via
Architecture:

 package MyArchitecture;
 use base qw( KinoSearch::Architecture );

 sub register_highlight_writer {
   my ( $self, $seg_writer ) = @_;
   $self->SUPER::register_highlight_writer($seg_writer);
my $hl_writer = $seg_writer- >obtain("KinoSearch::Index::HighlightWriter");
   $hl_writer->set_sentence_tokenizer( MySentenceTokenizer->new );
 }

Or maybe $hl_writer->add_tokenizer( MySentenceTokenizer->new );
We may need to distinguish between ‘offset tokenisers’ and ‘term tokenisers’.


I think this approach will work provided that it's possible to use the same sentence boundary detection algo across most or all of the languages supported by Snowball. (Does the basic algo of splitting on /\.\s+/ work for Greek?)

Yes, except for the same problem that it causes in English: ‘M. Humphrey‘ becomes two sentences. (As an aside, your default tokeniser doesn’t work with Greek, which can have mid-commas, but the only two words with mid commas [ὅ,τι and ὅ,τιδηποτε] are stop- words, so I don’t worry about it.)

CJK users and others for whom our algo would fail would need to spec a custom Architecture -- though only if they want highlighting, since it's off by default. It's a bit more work for that class of user, but it prevents us from having to add clutter to the crucial core classes of Analyzer and Schema.

It will be somewhat wasteful if we use this SentenceTokenizer class to create full fledged tokens when all we need is offsets, but I think we would handle further optimizations via natural extensions to either Analyzer or Inversion. I say "natural", because we would be merely repurposing the same offset information that Tokenizer normally feeds to Token's constructor, as opposed to glomming on a Find_Sentences() method which would apply a completely
different tokenizing algorithm.

Sounds good.

Could the sentences be numbered, so the final fragment has information about *which* sentence it came from? (I could use this for pagination.)

I think that would work. The current "DocVector" class needs to mutate into
"InvertedDoc" or something like that, and InvertedDoc needs to provide
sentence boundary information somehow.

We often need to use iterators for scaling purposes in KS/Lucy, but huge docs are problematic for highlighting anyway, so I think we can just go with two i32_t arrays: one each for sentence_offsets and sentence_lengths. In the index, we'd probably store this information as a string of delta- encoded

‘Delta-encoded’?

C32s
representing offset from the top of the field measured in Unicode code
points.

Source:

 "Best. Joke. Ever."

Search-time:

 $inverted_doc->get_sentence_offsets; # [ 0, 6, 12 ]
 $inverted_doc->get_sentence_lengths; # [ 5, 5, 5 ]

In the index:

 0, 5, 1, 5, 1, 5

That preserves your requested sentence numbering information through read time, accessible as array tick in the sentence_offset and sentence_lengths
arrays.

Perhaps if each Span were to include a reference to the original Query object which produced it? These would be primitives such as TermQuery and
PhraseQuery rather than compound queries like ANDQuery.  Would that
reference be enough to implement a preference for term diversity in the
excerpting algo?

There is one scenario I can think of where that *might* not work. If
someone searches for a list of keywords that includes the same keyword
twice (e.g., I sometimes copy and paste a sentence to find documents
with similar content), then there will be two TermQueries that are
identical but considered different.

All Query classes should be implementing the Equals() method so that logically
equivalent objects can be identified.  Does that address your concern?

Yes.

We'll probably want to reference the Compiler/Weight rather than the original Query; right now in KS I don't think I have Equals() implemented for any
Compiler classes, but that shouldn't be hard to finish.  [1]

Maybe this won’t matter because the duplicate term should have extra
weight. I haven’t thought this through.

I think the only way we'll nail the extensibility aspect of this design is if
we build working implementations for multiple highlighting algorithms.

Probably your Summarizer and a class which implements the
term-diversity-preferring algo described by Michael Busch and Mike McCandless
from LUCENE-1522 would be enough.

I would like to make Summarizer value term diversity, so we’ll be left with one. I could make it an option instead.


And might that information come in handy for other excerpting algos?

As long as the supplied Term/PhraseQuery is the original object, and
not a clone, I think it would.

I think you say that because of the equivalence question, right?

Yes.

The KS Highlighter creates its own internal Compiler object using the supplied "searchable" and "query" constructor args. The DocVector/ InvertedDoc has to be able to go over the network, but the score spans won't -- so each score span would always be pointing to some sub-component of that local Compiler
object.

I'm not entirely satisfied with this approach. The Span class has been simple up till now -- it *could* have been sent over the network with no problem. Bloating it up with a reference to the Query/Compiler makes it both less
general and less transportable.

How about $compiler->give_me_the_query_for($span)? (with a better method name, of course.) Or would that make Compiler too complex, since it would have to store a hash (or equivalent) in addition to its array of spans?

But I thought queries could be sent over the network.


Father Chrysostomos

Reply via email to