This is an automated email from the ASF dual-hosted git repository.
aradzinski pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-nlpcraft-website.git
The following commit(s) were added to refs/heads/master by this push:
new b8f78aa WIP.
b8f78aa is described below
commit b8f78aa4fcbf2e78ff62207ef1b2a691fe97e514
Author: Aaron Radzinski <[email protected]>
AuthorDate: Sun Jan 17 20:09:19 2021 -0800
WIP.
---
blogs/how_to_find_something_in_the_text.html | 390 ++-------------------------
images/how_to_find_something_fig1.png | Bin 0 -> 35552 bytes
2 files changed, 24 insertions(+), 366 deletions(-)
diff --git a/blogs/how_to_find_something_in_the_text.html
b/blogs/how_to_find_something_in_the_text.html
index dea6d49..9644443 100644
--- a/blogs/how_to_find_something_in_the_text.html
+++ b/blogs/how_to_find_something_in_the_text.html
@@ -37,382 +37,40 @@ publish_date: January 20, 2021
such as dates, countries, cities as well as domain specific for your
model. It is also important to note that
we are talking about a class of NLP tasks where you actually know what
you are looking for.
</p>
-</section>
-<section>
- <h2 class="section-title">Parsing User Input</h2>
- <p>
- One of the key objectives when parsing user input sentence for Natural
Language Understanding (NLU) is to
- detect all possible semantic entities, a.k.a <em>named entities</em>.
Let's consider a few examples:
- </p>
- <ul>
- <li>
- <code>"What's the current weather in Tokyo?"</code><br/>
- This sentence is fully sufficient for the processing
- since it contains the topic <code>weather</code> as well as all
necessary parameters
- like time (<code>current</code>) and location (<code>Tokyo</code>).
- </li>
- <li>
- <code>"What about Tokyo?"</code><br/>
- This is an unclear sentence since it does not have the subject of
the
- question - what is it about Tokyo?
- </li>
- <li>
- <code>"What's the weather?"</code><br/>
- This is also unclear since we are missing important parameters
- of location and time for our request.
- </li>
- </ul>
- <p>
- Sometimes we can use default values like the current user's location
and the current time (if they are missing).
- However, this can easily lead to the wrong interpretation if the
conversation has an existing context.
- </p>
- <p>
- In real life, as well as in NLP-based systems, we always try to start
a conversation with a fully defined
- sentence since without a context the missing information cannot be
obtained and the sentenced cannot be interpreted.
- </p>
-</section>
-<section>
- <h2 class="section-title">Semantic Entities</h2>
- <p>
- Let's take a closer look at the named entities from the above examples:
- </p>
- <ul>
- <li>
- <code>weather</code> - this is an indicator of the subject of the
conversation. Note that it indicates
- the type of question rather than being an entity with multiple
possible values.
- </li>
- <li>
- <code>current</code> - this is an entity of type <code>Date</code>
with the value of <code>now</code>.
- </li>
- <li>
- <code>Tokyo</code> - this is an entity of type
<code>Location</code> with two values:
- <ul>
- <li><code>city</code> - type of the location.</li>
- <li><code>Tokyo, Japan</code> - normalized name of the
location.</li>
- </ul>
- </li>
- </ul>
- <p>
- We have two distinct classes of entities:
- </p>
- <ul>
- <li>
- Entities that have no values and only act as indicators or types.
The entity <code>weather</code> is the
- type indicator for the subject of the user input.
- </li>
- <li>
- Entities that additionally have one or more specific values like
<code>current</code> and <code>Tokyo</code> entities.
- </li>
- </ul>
- <div class="bq success">
- <div style="display: inline-block; margin-bottom: 20px">
- <a style="margin-right: 10px" target="opennlp"
href="https://opennlp.apache.org"><img src="/images/opennlp-logo.png"
height="32px" alt=""></a>
- <a style="margin-right: 10px" target="google"
href="https://cloud.google.com/natural-language/"><img
src="/images/google-cloud-logo-small.png" height="32px" alt=""></a>
- <a style="margin-right: 10px" target="stanford"
href="https://stanfordnlp.github.io/CoreNLP"><img
src="/images/corenlp-logo.gif" height="48px" alt=""></a>
- <a style="margin-right: 10px" target="spacy"
href="https://spacy.io"><img src="/images/spacy-logo.png" height="32px"
alt=""></a>
- </div>
- <p>
- Note that NLPCraft provides <a
href="/integrations.html">support</a> for wide variety of named entities (with
all built-in ones being properly normalized)
- including <a href="/integrations.html">integrations</a> with
- <a target="spacy" href="https://spacy.io/">spaCy</a>,
- <a target="stanford"
href="https://stanfordnlp.github.io/CoreNLP">Stanford CoreNLP</a>,
- <a target="opennlp" href="https://opennlp.apache.org/">OpenNLP</a>
and
- <a target="google"
href="https://cloud.google.com/natural-language/">Google Natural Language</a>.
- </p>
- </div>
-</section>
-<section>
- <h2 class="section-title">Incomplete Sentences</h2>
- <p>
- Assuming previously asked questions about the weather in Tokyo (in the
span of the ongoing conversation) one
- could presumably ask the following questions using a <em>shorter,
incomplete</em>, form:
- </p>
- <ul>
- <li>
- <code>"What about Kyoto?</code><br/>
- This question is missing both the subject and the time. However, we
- can safely assume we are still talking about current weather.
- </li>
- <li>
- <code>"What about tomorrow?"</code><br/>
- Just like above we automatically assume the weather subject but
- use <code>Kyoto</code> as the location since it was mentioned the
last.
- </li>
- </ul>
- <p>
- These are incomplete sentences. This type of short-hands cannot be
interpreted without prior context (neither
- by humans or by machines) since by themselves they are missing
necessary information.
- In the context of the conversation, however, these incomplete
sentences work. We can simply provide one or two
- entities and rely on the <em>"listener"</em> to recall the rest of
missing information from a
- <em>short-term memory</em>, a.k.a conversation context.
- </p>
- <p>
- In NLPCraft, the intent-matching logic will automatically try to find
missing information in the
- conversation context (that is automatically maintained). Moreover, it
will properly treat such recalled
- information during weighted intent matching since it naturally has
less "weight" than something that was
- found explicitly in the user input.
- </p>
-</section>
-<section>
- <h2 class="section-title">Short-Term Memory</h2>
- <p>
- The short-term memory is exactly that... a memory that keeps only
small amount of recently used information
- and that evicts its contents after a short period of inactivity.
- </p>
- <p>
- Let's look at the example from a real life. If you would call your
friend in a couple of hours asking <code>"What about a day after?"</code>
- (still talking about weather in Kyoto) - he'll likely be thoroughly
confused. The conversation is timed out, and
- your friend has lost (forgotten) its context. You will have to explain
again to your confused friend what is that you are asking about...
- </p>
- <p>
- NLPCraft has a simple rule that 5 minutes pause in conversation leads
to the conversation context reset. However,
- what happens if we switch the topic before this timeout elapsed?
- </p>
-</section>
-<section>
- <h2 class="section-title">Context Switch</h2>
- <p>
- Resetting the context by the timeout is, obviously, not a hard thing
to do. What can be trickier is to detect
- when conversation topic is switched and the previous context needs to
be forgotten to avoid very
- confusing interpretation errors. It is uncanny how humans can detect
such switch with seemingly no effort, and yet
- automating this task by the computer is anything but effortless...
- </p>
- <p>
- Let's continue our weather-related conversation. All of a sudden, we
ask about something completely different:
- </p>
- <ul>
- <li>
- <code>"How much mocha latter at Starbucks?"</code><br/>
- At this point we should forget all about previous conversation
about weather and assume going forward
- that we are talking about coffee in Starbucks.
- </li>
- <li>
- <code>"What about Peet's?"</code><br/>
- We are talking about latter at Peet's.
- </li>
- <li>
- <code>"...and croissant?"</code><br/>
- Asking about Peet's crescent-shaped fresh rolls.
- </li>
- </ul>
+ <figure>
+ <img class="img-fluid" src="/images/how_to_find_something_fig1.png"
alt="">
+ <figcaption><b>Fig 1.</b> Named Entities</figcaption>
+ </figure>
<p>
- Despite somewhat obvious logic the implementation of context switch is
not an exact science. Sometimes, you
- can have a "soft" context switch where you don't change the topic of
the conversation 100% but yet sufficiently
- enough to forget at least some parts of the previously collected
context. NLPCraft has a built-in algorithm
- to detect the hard switch in the conversation. It also exposes API to
perform a selective reset on the
- conversation in case of "soft" switch.
- </p>
-</section>
-<section>
- <h2 class="section-title">Overriding Entities</h2>
- <p>
- As we've seen above one named entity can replace or override an older
entity in the STM, e.g. <code>"Peet's"</code>
- replaced <code>"Starbucks"</code> in our previous questions. <b>The
actual algorithm that governs this logic is one
- of the most important part of STM implementation.</b> In human
conversations we perform this logic seemingly
- subconsciously — but the computer algorithm to do it is not that
trivial. Let's see how it is done in NLPCraft.
- </p>
- <p>
- One of the important supporting design decision is that an entity can
belong to one or more groups. You can think of
- groups as types, or classes of entities (to be mathematically precise
these are the categories). The entity's
- membership in such groups is what drives the rule of overriding.
- </p>
- <p>
- Let's look at the specific example.
- </p>
- <p>
- Consider a data model that defined 3 entities:
- </p>
- <ul>
- <li>
- <code>"sell"</code> (with synonym <code>"sales"</code>)
- </li>
- <li>
- <code>"buy"</code> (with synonym <code>"purchase"</code>)
- </li>
- <li>
- <code>"best_employee"</code> (with synonyms like
<code>"best"</code>, <code>"best employee"</code>, <code>"best
colleague"</code>)
- </li>
- </ul>
- <p>
- Our task is to support for following conversation:
- </p>
- <ul>
- <li>
- <code>"Give me the sales data"</code><br/>
- We return sales information since we detected <code>"sell"</code>
entity by its synonym <code>"sales"</code>.
- </li>
- <li>
- <code>"Who was the best?"</code><br/>
- We return the best salesmen since we detected
<code>"best_employee"</code> and we should pick <code>"sell"</code> entity from
the STM.
- </li>
- <li>
- <code>"OK, give me the purchasing report now."</code><br/>
- This is a bit trickier. We should return general purchasing data
and not a best purchaser employee.
- It feels counter-intuitive but we should NOT take
<code>"best_employee"</code> entity from STM and, in fact, we should remove it
from STM.
- </li>
- <li>
- <code>"...and who's the best there?"</code><br/>
- Now, we should return the best purchasing employee. We detected
<code>"best_employee"</code> entity and we should pick <code>"buy"</code>
entity from STM.
- </li>
- <li>
- <code>"One more time - show me the general purchasing data
again"</code><br/>
- Once again, we should return a general purchasing report and
ignore (and remove) <code>"best_employee"</code> from STM.
- </li>
- </ul>
-</section>
-<section>
- <h2 class="section-title">Overriding Rule</h2>
- <p>
- Here's the rule we developed at NLPCraft and have been successfully
using in various models:
- </p>
- <div class="bq success">
- <b>Overriding Rule</b>
- <p>
- The entity will override other entity or entities in STM that
belong to the same group set or its superset.
- </p>
- </div>
- <p>
- In other words, an entity with a smaller group set (more specific one)
will override entity with the same
- or larger group set (more generic one).
- Let's consider an entity that belongs to the following groups:
<code>{G1, G2, G3}</code>. This entity:
- </p>
- <ul>
- <li>
- WILL override existing entity belonging to <code>{G1, G2,
G3}</code> groups (same set).
- </li>
- <li>
- WILL override existing entity belonging to <code>{G1, G2, G3,
G4}</code> groups (superset).
- </li>
- <li>
- WILL NOT override existing entity belonging to <code>{G1,
G2}</code> groups.
- </li>
- <li>
- WIL NOT override existing entity belonging to <code>{G10,
G20}</code> groups.
- </li>
- </ul>
- <p>
- Let's come back to our sell/buy/best example. To interpret the
questions we've outlined above we need to
- have the following 4 intents:
- </p>
- <ul>
- <li><code>id=sale term={id=='sale'}</code></li>
- <li><code>id=best_sale_person term={id=='sale'}
term={id==best_employee}</code></li>
- <li><code>id=buy term={id=='buy'}</code></li>
- <li><code>id=buy_best_person term={id=='buy'}
term={id==best_employee}</code></li>
- </ul>
- <p>
- (this is actual <a href="/intent-matching.html#syntax">Intent DSL</a>
used by NLPCraft -
- <code>term</code> here is basically what's often referred to as a slot
in other systems).
+ The software component responsible for finding the named entities is
called a named entity recognition
+ (NER) component. Its goal is to find a certain entity in the input
text and optionally extract additional
+ information about this entity. For example, consider the sentence
"Give me <b>twenty two</b> face masks". A numeric
+ NER component will find numeric entity “twenty two” and will extract
normalized integer value “22” from it
+ which can be then used further.
</p>
<p>
- We also need to properly configure groups for our entities (names of
the groups are arbitrary):
+ NER components are usually based on neural networks that are trained
on extensive and well prepared models
+ (corpus) as well as on simpler rule-based or synonym matching
algorithms that are better suited for domain
+ specific applications. Note that most universal NER components tend to
use some variation of neural networks
+ based algorithms.
</p>
- <ul>
- <li>Entity <code>"sell"</code> belongs to group <b>A</b></li>
- <li>Entity <code>"buy"</code> belongs to group <b>B</b></li>
- <li>Entity <code>"best_employee"</code> belongs to groups <b>A</b> and
<b>B</b></li>
- </ul>
<p>
- Let’s run through our example again with this configuration:
- </p>
- <ul>
- <li>
- <code>"Give me the sales data"</code>
- <ul>
- <li>We detected entity from group <b>A</b>.</li>
- <li>STM is empty at this point.</li>
- <li>Return general sales report.</li>
- <li>Store <code>"sell"</code> entity with group <b>A</b> in
STM.</li>
- </ul>
- </li>
- <li>
- <code>"Who was the best?"</code>
- <ul>
- <li>We detected entity belonging to groups <b>A</b> and
<b>B</b>.</li>
- <li>STM has entity belonging to group <b>A</b>.</li>
- <li><b>{A, B}</b> does NOT override <b>{A}</b>.</li>
- <li>Return best salesmen report.</li>
- <li>Store detected <code>"best_employee"</code> entity.</li>
- <li>STM now has two entities with <b>{A}</b> and <b>{A, B}</b>
group sets.</li>
- </ul>
- </li>
- <li>
- <code>"OK, give me the purchasing report now."</code>
- <ul>
- <li>We detected <code>"buy"</code> entity with group
<b>A</b>.</li>
- <li>STM has two entities with <b>{A}</b> and <b>{A, B}</b>
group sets.</li>
- <li><b>{A}</b> overrides both <b>{A}</b> and <b>{A,
B}</b>.</li>
- <li>Return general purchasing report.</li>
- <li>Store <code>"buy"</code> entity with group <b>A</b> in
STM.</li>
- </ul>
- </li>
- </ul>
- <p>
- And so on... easy, huh 😇 In fact, the logic is indeed relatively
straightforward. It also follows
- common sense where the logic produced by this rule matches the
expected human behavior.
+ Rest of this blog will concentrate on a brief review of existing
popular products and open source projects
+ that provide NER components. We will also look at what Apache NLPCraft
project brings to the table on this
+ topic as well. Note that this review is far from exhaustive analysis
of these libraries but rather a quick
+ overview of their pros and cons as of the end of 2020.
</p>
</section>
<section>
- <h2 class="section-title">Explicit Context Switch</h2>
+ <h2 class="section-title">NER Providers</h2>
<p>
- In some cases you may need to explicitly clear the conversation STM
without relying on algorithmic behavior.
- It happens when current and new topic of the conversation share some
of the same entities. Although at first
- it sounds counter-intuitive there are many examples of that in day to
day life.
- </p>
- <p>
- Let’s look at this sample conversation:
- </p>
- <ul>
- <li>
- <b>Q</b>: <code>"What the weather in Tokyo?"</code><br/>
- <b>A</b>: Current weather in Tokyo...
- </li>
- <li>
- <b>Q</b>: <code>"Let’s do New York after all then!"</code><br/>
- <b>A</b>: Without an explicit conversation reset we would return
current New York weather 🤔
- </li>
- </ul>
- <p>
- The second question was about going to New York (booking tickets,
etc.). In real life - your
- counterpart will likely ask what you mean by "doing New York after
all" and you’ll have to explain
- the abrupt change in the topic.
- You can avoid this confusion by simply saying: "Enough about weather!
Let’s talk about this weekend plans" - after
- which the second question becomes clear. That sentence is an explicit
context switch which you can also detect
- in the NLPCraft model.
- </p>
- <p>
- In NLPCraft you can also explicitly reset conversation context through
API or by switching the model on the request.
+ Let's take a look at several well-known NLP libraries that provide
built-in NER components.
</p>
+ <h3 class="section-sub-title">Apache OpenNLP</h3>
+ <h3 class="section-sub-title">Stanford NLP</h3>
+ <h3 class="section-sub-title">Google Language API</h3>
+ <h3 class="section-sub-title">spacy</h3>
</section>
-<section>
- <h2 class="section-title">Summary</h2>
- <p>
- Let’s collect all our thoughts on STM into a few bullet points:
- </p>
- <ul>
- <li>
- Missing entities in incomplete sentences can be auto-recalled from
STM.
- </li>
- <li>
- Newly detected type/category entity is likely indicating the
change of topic.
- </li>
- <li>
- The key property of STM is its short-time storage and overriding
rule.
- </li>
- <li>
- The explicit context switch is an important mechanism.
- </li>
- </ul>
- <div class="bq info">
- <b>
- Short-Term Memory
- </b>
- <p>
- It is uncanny how properly implemented STM can make conversational
interface <b>feel like a normal human
- conversation</b>. It allows to minimize the amount of parasitic
dialogs and Q&A driven interfaces
- without unnecessarily complicating the implementation of such
systems.
- </p>
- </div>
-</section>
+
diff --git a/images/how_to_find_something_fig1.png
b/images/how_to_find_something_fig1.png
new file mode 100644
index 0000000..39c29ad
Binary files /dev/null and b/images/how_to_find_something_fig1.png differ