jenkins-bot has submitted this change and it was merged.
Change subject: Docs and cleanup
......................................................................
Docs and cleanup
Lots of work on how_it_works.md. Should be easier to understand.
Some cleanup of spacing and Javadocs.
Added a helper method to make the examples simpler.
Change-Id: I71d8a2118ce1745c5a78f67dd52d8760bb0390fe
---
M docs/how_it_works.md
M
experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/HitEnum.java
M
experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/SnippetFormatter.java
M
experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/hit/BreakIteratorHitEnum.java
M
experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/hit/weight/NoSourceTermSourceFinder.java
M
experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/snippet/AbstractBasicSnippetChooser.java
M
experimental-highlighter-core/src/test/java/org/wikimedia/search/highlighter/experimental/hit/BreakIteratorHitEnumTest.java
7 files changed, 233 insertions(+), 62 deletions(-)
Approvals:
Manybubbles: Looks good to me, approved
jenkins-bot: Verified
diff --git a/docs/how_it_works.md b/docs/how_it_works.md
index dd457d9..2163bc6 100644
--- a/docs/how_it_works.md
+++ b/docs/how_it_works.md
@@ -1,71 +1,245 @@
How does it work?
=================
-The highlighter has a few key types of components: <dl>
-<dt>HitEnum</dt><dd>Enumerates hits (aka matches)</dd>
-<dt>SnippetChooser</dt><dd>Uses HitEnum to pick snippets</dd>
-<dt>Segmenter</dt><dd>Plugged into AbstractBasicSnippetChooser implementations
-to pick where snippets (aka fragments) begin and end.</dt>
-</dl>
+This highlighter was created to unify some of the good ideas in the three
+highlighters already in Lucene:
+* Finding hits should be possible by:
+ * Reanalyzing the field's contents
+ * Reading term vectors
+ * Reading offsets from the postings
+* Snippets should be split on:
+ * Sentence boundaries
+ * Word boundaries
+ * Some arbitrary list of characters that is quick to scan
+To do that this project builds a set of composable components and an
+Elasticsearch plugin that strings them all together.
+
+From front to back:
+* the query is decomposed into a tree of ```HitEnums```
+* which are iterated by a ```SnippetChooser```
+* which determines snippet boundaries using a ```Segmenter```
+* and scores the ```Snippets``` using a ```SnippetWeigher```
+* ultimately producing a ```List<Snippet>```
+* which is then turned into ```String```s by a ```SnippetFormatter```.
HitEnum
-------
-Comes in two flavors:<dl>
-<dt>Plain HitEnums</dt>
-<dd>Actually pulls hits from the source document. Examples are
- DocsAndPositionsHitEnum, TokenStreamHitEnum, and BreakIteratorHitEnum.</dd>
-<dt>Transforming HitEnums</dt>
-<dd>Wraps and transforms one or more HitEnums. Some transforms are simple like
- WeightFilterHitEnumWrapper or PositionBoostingHitEnumWrapper. Some are
- much more involved like PhraseHitEnumWrapper and
- MergingHitEnum. HitEnums that wrap a single HitEnum should be named
- FooHitEnumWrapper. I don't have a consistent naming scheme for those that
- wrap more than one HitEnum.</dd>
-</dl>
+Comes in two flavors:
+* Source ```HitEnums``` pulls hits from the source document. Examples are
+```DocsAndPositionsHitEnum```, ```TokenStreamHitEnum```, and
+```BreakIteratorHitEnum```.
+* Transforming ```HitEnums``` wrap and transform one or more ```HitEnums```.
+Some transforms are simple like ```WeightFilterHitEnumWrapper``` or
+```PositionBoostingHitEnumWrapper```. Some are much more involved like
+```PhraseHitEnumWrapper``` and ```MergingHitEnum```. ```HitEnums``` that wrap a
+single ```HitEnum``` should be named ```FooHitEnumWrapper```. There isn't a
+consistent naming scheme for those that wrap more than one ```HitEnum```.
+Enum in this context is a riff on Lucene's ```TermsEnum```. Its just a
+convenient way of iterating without returning whole objects at each iteration.
+It looks like:
+
+```java
+public interface HitEnum {
+ /**
+ * Move the enum to the next hit.
+ *
+ * @return is there a next hit (true) or was the last one the final hit
+ * (false)
+ */
+ boolean next();
+
+ /**
+ * Ordinal position relative to the other terms in the text. Starts at 0.
+ */
+ int position();
+
+ /**
+ * The start offset of the current term within the text.
+ */
+ int startOffset();
+
+ ...
+}
+```
+Its not as fancy as ```TermsEnum``` with its nifty ```AttributeSource``` but it
+gets the job done.
+
+The idea is that you make one or more source ```HitEnums``` and wrap them in
+transforming ```HitEnums```. For testing most things we use
+```BreakIteraorHitEnum``` for the source ```HitEnum``` because it's simple. In
+the Elasticsearch plugin we use ```TokenStreamHitEnum``` when we have to
+reanalyze a string and ```DocsAndPositionsHitEnum``` when we can read the hits
+from term vectors or the postings list. It also uses ```RegexHitEnum``` and
+```AutomatonHitEnum``` for regular expression highlighting.
+
+The idea is to build a tree of ```HitEnum```s so that you can ultimately pass
+a single one to the ```SnipppetChooser``` and they all work together to iterate
+the hits. For example, you can limit the highlighting to terms that appear in a
+phrase by building a chain of ```HitEnum```s like this:
+
+```java
+// Note real constructors are more complex. Sorry.
+String str = "I like cats but I don't like fish";
+HitEnum e = BreakIteratorHitEnum.englishWords(str);
+e = new PhraseHitEnumWrapper(e, 10, "like", "cats");
+e = new WeightFilteredHitEnumWrapper(e, 1);
+while (e.next()) {
+ System.out.println(str.substring(e.startOffset(), e.endOffset()));
+}
+// Prints:
+// like
+// cats
+```
+
+That chain of enums:
+1. Returns every word in a sentence with a score of 1.
+1. Sets the score of the term ```like``` followed by ```cats``` to 10.
+1. Filters out all hits with score not greater than 1.
+
+But wait, there's more! You can merge multiple HitEnums! This is really
+important because when you load hits from Lucene's postings or term vectors
+they are only for a single term. So it looks like this:
+
+```java
+// Note real constructors are more complex. Sorry.
+HitEnum l = new DocsAndPositionsHitEnum(reader, docId, field, "like");
+HitEnum c = new DocsAndPositionsHitEnum(reader, docId, field, "cats");
+HitEnum m = new MergingHitEnum(ImmutableList.of(likeEnum, catsEnum),
+ HitEnum.LessThans.POSITION);
+HitEnum p = new PhraseHitEnumWrapper(e, 10, "like", "cats");
+HitEnum f = new WeightFilteredHitEnumWrapper(e, 1);
+while (f.next()) {
+ System.out.println(str.substring(e.startOffset(), e.endOffset()));
+}
+// Prints:
+// like
+// cats
+```
+
+This creates a little tree of ```HitEnum```s like this:
+
+```
+l c
+ \ /
+ \ /
+ m
+ |
+ |
+ p
+ |
+ |
+ f
+```
+
+Pulling on ```f``` pulls on ```p``` which pulls on ```m``` which pulls on
+```l``` or ```c```. It gets much more complicated in the real world with
+potentially multiple levels of merging and filtering.
SnippetChooser
--------------
-There are two implementations:<dl>
+```SnippetChooser```s are responsible for pulling on the HitEnum and picking
+the right List<Snippet> to return.
+There are two concrete implementations:<dl>
<dt>BasicScoreBasedSnippetChooser</dt>
-<dd>Score ordered and score cutoff snippets. Keeps snippets in a priority
+<dd>Score ordered and score cutoff snippets. Keeps snippets in a priority
queue then sorts them in document order to pick bounds so there is no
overlap then optionally sorts them in score order. Worst case performance
is ```O(n*log(m) + m*log(m))``` where n is number of snippets found and m
is number of snippets requested. The first term is scanning all the
- segments and the second is the sorts. You can put an upper bound on n by
+ segments and the second is the sorts. You can put an upper bound on n by
setting ```maxSnippetsChecked``` which is piped through to Elasticsearch as
- ```max_fragments_scored```.</dd>
+ ```max_fragments_scored```. You can put an upper bound on m with the
+ ```max``` parameter on ```SnippetChooser.choose``` which is piped through
+ Elasticsearch as ```number_of_fragments```.</dd>
<dt>BasicSourceOrderSnippetChooser</dt>
-<dd>Source order fragments. Can be much faster because it can exit after
- hitting the first snippet.</dd>
+<dd>Source order fragments. Can be much faster because it can exit after
+ hitting ```number_of_fragments``` snippets.</dd>
</dl>
Segmenter
---------
-Four major implementations:<dl>
-<dt>CharScanningSegmenter</dt>
-<dd>FastVectorHighlighter like character scanning. Usually the fastest choice
- for large text.</dd>
-<dt>BreakIteratorSegmenter</dt>
-<dd>PostingHighlighter like sentence breaks. Slower but sometimes prettier.
- Suffers if text isn't 100% prose and/or some sentences are hugely
- long.</dd>
-<dt>WholeSourceSegmenter</dt>
-<dd>Doesn't break the source at all.</dd>
-<dt>MultiSegmenter</dt>
-<dd>Wraps many segmenters adding hard stops between them. The life blood of
- multi valued fields.</dd>
-</dl>
+The ```SnippetChooser``` uses the Segmenter to decide if two hits are part of
+the same segment and, once its picked the best segments, to find the bounds of
+the segment. That is a lot of words so have an example:
-The Elasticsearch plugin adds a DelayedSegmenter which constructs one of the
-first three segmenters lazily to prevent loading the field until the first hit
-is found.
+Say you are highlighting the words ```like cats``` in the paragraph:
+
+```
+*Cats* are just super duper dandy. Even when they scratch and bit I just *like*
+*cats* so much! Man. I *like* *cats*.
+```
+
+Assuming you don't want to return the whole silly paragraph, its the
+```Segmenter```'s job to decide where to break the paragraph. There are four
+major implementation:
+
+The ```CharScanningSegmenter``` does ```FastVectorHighlighter``` like character
+scanning. Usually its the fastest choice for large text. It would segment the
+paragraph like this:
+
+```
+*Cats* are just super
+duper dandy. Even when
+they scratch and bit I
+just *like* *cats* so
+much! Man. I *like* *cats*.
+```
+
+The ```BreakIteratorSegmenter``` does ```PostingHighlighter``` like sentence or
+paragraph breaks. Its slower but sometimes prettier. It isn't prettier if the
+text isn't 100% prose or some sentences are hugely long. It would segment the
+paragraph like this:
+
+```
+*Cats* are just super duper dandy.
+Even when they scratch and bit I just *like* *cats* so much!
+Man.
+I *like* *cats*.
+```
+
+The ```WholeSourceSegmenter``` just returns the whole segment all the time.
+
+The ```MultiSegmenter``` wraps many ```Segmenters``` adding hard stops between
+them. Its used to prevent segments from spanning multiple values on the same
+field.
+
+The Elasticsearch plugin adds a ```DelayedSegmenter``` which constructs one of
+the first three ```Segmenters``` lazily to prevent loading the field until the
+first hit is found.
-Others
----------------
-SourceExtracters are responsible for extracting the snippets from the source
-once they are identified.
+Other
+-----
+The other components listed in the walk-through offer less interesting options
+but they are covered below for completeness sake.
+
+The ```SnippetChooser``` delegates scoring each segment to a
+```SnippetWeigher```. There are two ```SnippetWeighers```:
+* ```SumSnipperWeigher``` just adds up the weight of all of the hits in the
+snippet.
+* ```ExponentialSnippetWeigher``` tries to rate snippets that contain different
+terms more highly by only increasing the score for repeated terms by
+
+The ```Snippets``` are turned into strings by a ```SnippetFormatter``` with the
+help of a ```SourceExtracter```. There are two concrete
+```SnippperFormatter```s:
+* ```SnippetFormatter.Default```: Adds html tags around the hits like
+```<em>```.
+* ```OffsetSnippetFormatter```: Returns the offsets into the text of the hit
+and the rest of the snippet. Faster because it doesn't have to load the text
+but useless in many cases. Even in cases where it can be used you have to be
+very careful of character encodings as those offsets are for Java strings.
+
+```SourceExtracter```s turn offsets into substrings. There are several
+implementations:
+* ```StringSourceExtracter``` wraps a string and calls substring.
+* ```NonMergingMultiSourceExtracter``` wraps many ```SourceExtracter```s and
+makes them look like one ```SourceExtracter``` with the offsets strung
together.
+If and extraction tries to straddle multiple wrapped ```SourceExtracter```s
+then it throws and exception.
+* ```StringMergingMultiSourceExtracter``` is just like
+```NonMergingMultiSourceExtracter``` but if extractions straddle multiple
+wrapped ```SourceExtracter```s then the extracted strings are merged.
diff --git
a/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/HitEnum.java
b/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/HitEnum.java
index 224b2ff..c6d8eb8 100644
---
a/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/HitEnum.java
+++
b/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/HitEnum.java
@@ -21,16 +21,6 @@
int position();
/**
- * The start offset of the current term within the text.
- */
- int startOffset();
-
- /**
- * The end offset of the current term within the text.
- */
- int endOffset();
-
- /**
* Weight of the hit from the query definition. This stores the weight that
* the user placed on the search term. Only positive numbers are valid.
*/
diff --git
a/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/SnippetFormatter.java
b/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/SnippetFormatter.java
index 942de27..2ffed0b 100644
---
a/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/SnippetFormatter.java
+++
b/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/SnippetFormatter.java
@@ -30,6 +30,5 @@
b.append(extracter.extract(lastWritten, snippet.endOffset()));
return b.toString();
}
-
}
}
diff --git
a/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/hit/BreakIteratorHitEnum.java
b/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/hit/BreakIteratorHitEnum.java
index 64f4819..33042de 100644
---
a/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/hit/BreakIteratorHitEnum.java
+++
b/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/hit/BreakIteratorHitEnum.java
@@ -1,6 +1,7 @@
package org.wikimedia.search.highlighter.experimental.hit;
import java.text.BreakIterator;
+import java.util.Locale;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@@ -25,6 +26,16 @@
return new RepairedHitEnum(e, source);
}
+ /**
+ * Builds a HitEnum that returns one hit per word as segmented by the
+ * BreakIterator for English with a score of 1.
+ */
+ public static HitEnum englishWords(String str) {
+ BreakIterator itr = BreakIterator.getWordInstance(Locale.ENGLISH);
+ itr.setText(str);
+ return repair(new BreakIteratorHitEnum(itr), str);
+ }
+
private final BreakIterator itr;
private final HitWeigher queryWeigher;
private final HitWeigher corpusWeigher;
diff --git
a/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/hit/weight/NoSourceTermSourceFinder.java
b/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/hit/weight/NoSourceTermSourceFinder.java
index 582a4be..b42c8d7 100644
---
a/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/hit/weight/NoSourceTermSourceFinder.java
+++
b/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/hit/weight/NoSourceTermSourceFinder.java
@@ -3,7 +3,7 @@
import org.wikimedia.search.highlighter.experimental.hit.TermSourceFinder;
/**
- * Finds no source (0) for any terms.
+ * Finds no source (0) for all terms.
*/
public class NoSourceTermSourceFinder<T> implements TermSourceFinder<T> {
@Override
diff --git
a/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/snippet/AbstractBasicSnippetChooser.java
b/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/snippet/AbstractBasicSnippetChooser.java
index fe2ddad..7edfc8e 100644
---
a/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/snippet/AbstractBasicSnippetChooser.java
+++
b/experimental-highlighter-core/src/main/java/org/wikimedia/search/highlighter/experimental/snippet/AbstractBasicSnippetChooser.java
@@ -7,8 +7,8 @@
import org.wikimedia.search.highlighter.experimental.HitEnum;
import org.wikimedia.search.highlighter.experimental.Segmenter;
import org.wikimedia.search.highlighter.experimental.Snippet;
-import org.wikimedia.search.highlighter.experimental.SnippetChooser;
import org.wikimedia.search.highlighter.experimental.Snippet.Hit;
+import org.wikimedia.search.highlighter.experimental.SnippetChooser;
public abstract class AbstractBasicSnippetChooser<S> implements SnippetChooser
{
private final Snippet.HitBuilder hitBuilder;
diff --git
a/experimental-highlighter-core/src/test/java/org/wikimedia/search/highlighter/experimental/hit/BreakIteratorHitEnumTest.java
b/experimental-highlighter-core/src/test/java/org/wikimedia/search/highlighter/experimental/hit/BreakIteratorHitEnumTest.java
index dea6e7a..16eb47e 100644
---
a/experimental-highlighter-core/src/test/java/org/wikimedia/search/highlighter/experimental/hit/BreakIteratorHitEnumTest.java
+++
b/experimental-highlighter-core/src/test/java/org/wikimedia/search/highlighter/experimental/hit/BreakIteratorHitEnumTest.java
@@ -17,10 +17,7 @@
public class BreakIteratorHitEnumTest extends AbstractHitEnumTestBase {
@Override
protected HitEnum buildEnum(String str) {
- BreakIterator itr = BreakIterator.getWordInstance(Locale.ENGLISH);
- itr.setText(str);
- return BreakIteratorHitEnum.repair(new BreakIteratorHitEnum(itr),
- str);
+ return BreakIteratorHitEnum.englishWords(str);
}
@Test
--
To view, visit https://gerrit.wikimedia.org/r/222608
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: I71d8a2118ce1745c5a78f67dd52d8760bb0390fe
Gerrit-PatchSet: 2
Gerrit-Project: search/highlighter
Gerrit-Branch: master
Gerrit-Owner: Manybubbles <[email protected]>
Gerrit-Reviewer: DCausse <[email protected]>
Gerrit-Reviewer: EBernhardson <[email protected]>
Gerrit-Reviewer: Manybubbles <[email protected]>
Gerrit-Reviewer: jenkins-bot <>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits