This is an automated email from the ASF dual-hosted git repository.

aradzinski pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-nlpcraft-website.git


The following commit(s) were added to refs/heads/master by this push:
     new b8f78aa  WIP.
b8f78aa is described below

commit b8f78aa4fcbf2e78ff62207ef1b2a691fe97e514
Author: Aaron Radzinski <[email protected]>
AuthorDate: Sun Jan 17 20:09:19 2021 -0800

    WIP.
---
 blogs/how_to_find_something_in_the_text.html | 390 ++-------------------------
 images/how_to_find_something_fig1.png        | Bin 0 -> 35552 bytes
 2 files changed, 24 insertions(+), 366 deletions(-)

diff --git a/blogs/how_to_find_something_in_the_text.html 
b/blogs/how_to_find_something_in_the_text.html
index dea6d49..9644443 100644
--- a/blogs/how_to_find_something_in_the_text.html
+++ b/blogs/how_to_find_something_in_the_text.html
@@ -37,382 +37,40 @@ publish_date: January 20, 2021
         such as dates, countries, cities as well as domain specific for your 
model. It is also important to note that
         we are talking about a class of NLP tasks where you actually know what 
you are looking for.
     </p>
-</section>
-<section>
-    <h2 class="section-title">Parsing User Input</h2>
-    <p>
-        One of the key objectives when parsing user input sentence for Natural 
Language Understanding (NLU) is to
-        detect all possible semantic entities, a.k.a <em>named entities</em>. 
Let's consider a few examples:
-    </p>
-    <ul>
-        <li>
-            <code>"What's the current weather in Tokyo?"</code><br/>
-            This sentence is fully sufficient for the processing
-            since it contains the topic <code>weather</code> as well as all 
necessary parameters
-            like time (<code>current</code>) and location (<code>Tokyo</code>).
-        </li>
-        <li>
-            <code>"What about Tokyo?"</code><br/>
-            This is an unclear sentence since it does not have the subject of 
the
-            question - what is it about Tokyo?
-        </li>
-        <li>
-            <code>"What's the weather?"</code><br/>
-            This is also unclear since we are missing important parameters
-            of location and time for our request.
-        </li>
-    </ul>
-    <p>
-        Sometimes we can use default values like the current user's location 
and the current time (if they are missing).
-        However, this can easily lead to the wrong interpretation if the 
conversation has an existing context.
-    </p>
-    <p>
-        In real life, as well as in NLP-based systems, we always try to start 
a conversation with a fully defined
-        sentence since without a context the missing information cannot be 
obtained and the sentenced cannot be interpreted.
-    </p>
-</section>
-<section>
-    <h2 class="section-title">Semantic Entities</h2>
-    <p>
-        Let's take a closer look at the named entities from the above examples:
-    </p>
-    <ul>
-        <li>
-            <code>weather</code> - this is an indicator of the subject of the 
conversation. Note that it indicates
-            the type of question rather than being an entity with multiple 
possible values.
-        </li>
-        <li>
-            <code>current</code> - this is an entity of type <code>Date</code> 
with the value of <code>now</code>.
-        </li>
-        <li>
-            <code>Tokyo</code> - this is an entity of type 
<code>Location</code> with two values:
-            <ul>
-                <li><code>city</code> - type of the location.</li>
-                <li><code>Tokyo, Japan</code> - normalized name of the 
location.</li>
-            </ul>
-        </li>
-    </ul>
-    <p>
-        We have two distinct classes of entities:
-    </p>
-    <ul>
-        <li>
-            Entities that have no values and only act as indicators or types. 
The entity <code>weather</code> is the
-            type indicator for the subject of the user input.
-        </li>
-        <li>
-            Entities that additionally have one or more specific values like 
<code>current</code> and <code>Tokyo</code> entities.
-        </li>
-    </ul>
-    <div class="bq success">
-        <div style="display: inline-block; margin-bottom: 20px">
-            <a style="margin-right: 10px" target="opennlp" 
href="https://opennlp.apache.org";><img src="/images/opennlp-logo.png" 
height="32px" alt=""></a>
-            <a style="margin-right: 10px" target="google" 
href="https://cloud.google.com/natural-language/";><img 
src="/images/google-cloud-logo-small.png" height="32px" alt=""></a>
-            <a style="margin-right: 10px" target="stanford" 
href="https://stanfordnlp.github.io/CoreNLP";><img 
src="/images/corenlp-logo.gif" height="48px" alt=""></a>
-            <a style="margin-right: 10px" target="spacy" 
href="https://spacy.io";><img src="/images/spacy-logo.png" height="32px" 
alt=""></a>
-        </div>
-        <p>
-            Note that NLPCraft provides <a 
href="/integrations.html">support</a> for wide variety of named entities (with 
all built-in ones being properly normalized)
-            including  <a href="/integrations.html">integrations</a> with
-            <a target="spacy" href="https://spacy.io/";>spaCy</a>,
-            <a target="stanford" 
href="https://stanfordnlp.github.io/CoreNLP";>Stanford CoreNLP</a>,
-            <a target="opennlp" href="https://opennlp.apache.org/";>OpenNLP</a> 
and
-            <a target="google" 
href="https://cloud.google.com/natural-language/";>Google Natural Language</a>.
-        </p>
-    </div>
-</section>
-<section>
-    <h2 class="section-title">Incomplete Sentences</h2>
-    <p>
-        Assuming previously asked questions about the weather in Tokyo (in the 
span of the ongoing conversation) one
-        could presumably ask the following questions using a <em>shorter, 
incomplete</em>, form:
-    </p>
-    <ul>
-        <li>
-            <code>"What about Kyoto?</code><br/>
-            This question is missing both the subject and the time. However, we
-            can safely assume we are still talking about current weather.
-        </li>
-        <li>
-            <code>"What about tomorrow?"</code><br/>
-            Just like above we automatically assume the weather subject but
-            use <code>Kyoto</code> as the location since it was mentioned the 
last.
-        </li>
-    </ul>
-    <p>
-        These are incomplete sentences. This type of short-hands cannot be 
interpreted without prior context (neither
-        by humans or by machines) since by themselves they are missing 
necessary information.
-        In the context of the conversation, however, these incomplete 
sentences work. We can simply provide one or two
-        entities and rely on the <em>"listener"</em> to recall the rest of 
missing information from a
-        <em>short-term memory</em>, a.k.a conversation context.
-    </p>
-    <p>
-        In NLPCraft, the intent-matching logic will automatically try to find 
missing information in the
-        conversation context (that is automatically maintained). Moreover, it 
will properly treat such recalled
-        information during weighted intent matching since it naturally has 
less "weight" than something that was
-        found explicitly in the user input.
-    </p>
-</section>
-<section>
-    <h2 class="section-title">Short-Term Memory</h2>
-    <p>
-        The short-term memory is exactly that... a memory that keeps only 
small amount of recently used information
-        and that evicts its contents after a short period of inactivity.
-    </p>
-    <p>
-        Let's look at the example from a real life. If you would call your 
friend in a couple of hours asking <code>"What about a day after?"</code>
-        (still talking about weather in Kyoto) - he'll likely be thoroughly 
confused. The conversation is timed out, and
-        your friend has lost (forgotten) its context. You will have to explain 
again to your confused friend what is that you are asking about...
-    </p>
-    <p>
-        NLPCraft has a simple rule that 5 minutes pause in conversation leads 
to the conversation context reset. However,
-        what happens if we switch the topic before this timeout elapsed?
-    </p>
-</section>
-<section>
-    <h2 class="section-title">Context Switch</h2>
-    <p>
-        Resetting the context by the timeout is, obviously, not a hard thing 
to do. What can be trickier is to detect
-        when conversation topic is switched and the previous context needs to 
be forgotten to avoid very
-        confusing interpretation errors. It is uncanny how humans can detect 
such switch with seemingly no effort, and yet
-        automating this task by the computer is anything but effortless...
-    </p>
-    <p>
-        Let's continue our weather-related conversation. All of a sudden, we 
ask about something completely different:
-    </p>
-    <ul>
-        <li>
-            <code>"How much mocha latter at Starbucks?"</code><br/>
-            At this point we should forget all about previous conversation 
about weather and assume going forward
-            that we are talking about coffee in Starbucks.
-        </li>
-        <li>
-            <code>"What about Peet's?"</code><br/>
-            We are talking about latter at Peet's.
-        </li>
-        <li>
-            <code>"...and croissant?"</code><br/>
-            Asking about Peet's crescent-shaped fresh rolls.
-        </li>
-    </ul>
+    <figure>
+        <img class="img-fluid" src="/images/how_to_find_something_fig1.png" 
alt="">
+        <figcaption><b>Fig 1.</b> Named Entities</figcaption>
+    </figure>
     <p>
-        Despite somewhat obvious logic the implementation of context switch is 
not an exact science. Sometimes, you
-        can have a "soft" context switch where you don't change the topic of 
the conversation 100% but yet sufficiently
-        enough to forget at least some parts of the previously collected 
context. NLPCraft has a built-in algorithm
-        to detect the hard switch in the conversation. It also exposes API to 
perform a selective reset on the
-        conversation in case of "soft" switch.
-    </p>
-</section>
-<section>
-    <h2 class="section-title">Overriding Entities</h2>
-    <p>
-        As we've seen above one named entity can replace or override an older 
entity in the STM, e.g. <code>"Peet's"</code>
-        replaced <code>"Starbucks"</code> in our previous questions. <b>The 
actual algorithm that governs this logic is one
-        of the most important part of STM implementation.</b> In human 
conversations we perform this logic seemingly
-        subconsciously — but the computer algorithm to do it is not that 
trivial. Let's see how it is done in NLPCraft.
-    </p>
-    <p>
-        One of the important supporting design decision is that an entity can 
belong to one or more groups. You can think of
-        groups as types, or classes of entities (to be mathematically precise 
these are the categories). The entity's
-        membership in such groups is what drives the rule of overriding.
-    </p>
-    <p>
-        Let's look at the specific example.
-    </p>
-    <p>
-        Consider a data model that defined 3 entities:
-    </p>
-    <ul>
-        <li>
-            <code>"sell"</code> (with synonym <code>"sales"</code>)
-        </li>
-        <li>
-            <code>"buy"</code> (with synonym <code>"purchase"</code>)
-        </li>
-        <li>
-            <code>"best_employee"</code> (with synonyms like 
<code>"best"</code>, <code>"best employee"</code>, <code>"best 
colleague"</code>)
-        </li>
-    </ul>
-    <p>
-        Our task is to support for following conversation:
-    </p>
-    <ul>
-        <li>
-            <code>"Give me the sales data"</code><br/>
-            We return sales information since we detected <code>"sell"</code> 
entity by its synonym <code>"sales"</code>.
-        </li>
-        <li>
-            <code>"Who was the best?"</code><br/>
-            We return the best salesmen since we detected 
<code>"best_employee"</code> and we should pick <code>"sell"</code> entity from 
the STM.
-        </li>
-        <li>
-            <code>"OK, give me the purchasing report now."</code><br/>
-            This is a bit trickier. We should return general purchasing data 
and not a best purchaser employee.
-            It feels counter-intuitive but we should NOT take 
<code>"best_employee"</code> entity from STM and, in fact, we should remove it 
from STM.
-        </li>
-        <li>
-            <code>"...and who's the best there?"</code><br/>
-            Now, we should return the best purchasing employee. We detected 
<code>"best_employee"</code> entity and we should pick <code>"buy"</code> 
entity from STM.
-        </li>
-        <li>
-            <code>"One more time - show me the general purchasing data 
again"</code><br/>
-            Once again, we should return a general purchasing report and 
ignore (and remove) <code>"best_employee"</code> from STM.
-        </li>
-    </ul>
-</section>
-<section>
-    <h2 class="section-title">Overriding Rule</h2>
-    <p>
-        Here's the rule we developed at NLPCraft and have been successfully 
using in various models:
-    </p>
-    <div class="bq success">
-        <b>Overriding Rule</b>
-        <p>
-            The entity will override other entity or entities in STM that 
belong to the same group set or its superset.
-        </p>
-    </div>
-    <p>
-        In other words, an entity with a smaller group set (more specific one) 
will override entity with the same
-        or larger group set (more generic one).
-        Let's consider an entity that belongs to the following groups: 
<code>{G1, G2, G3}</code>. This entity:
-    </p>
-    <ul>
-        <li>
-            WILL override existing entity belonging to <code>{G1, G2, 
G3}</code> groups (same set).
-        </li>
-        <li>
-            WILL override existing entity belonging to <code>{G1, G2, G3, 
G4}</code> groups (superset).
-        </li>
-        <li>
-            WILL NOT override existing entity belonging to <code>{G1, 
G2}</code> groups.
-        </li>
-        <li>
-            WIL NOT override existing entity belonging to <code>{G10, 
G20}</code> groups.
-        </li>
-    </ul>
-    <p>
-        Let's come back to our sell/buy/best example. To interpret the 
questions we've outlined above we need to
-        have the following 4 intents:
-    </p>
-    <ul>
-        <li><code>id=sale term={id=='sale'}</code></li>
-        <li><code>id=best_sale_person term={id=='sale'} 
term={id==best_employee}</code></li>
-        <li><code>id=buy term={id=='buy'}</code></li>
-        <li><code>id=buy_best_person term={id=='buy'} 
term={id==best_employee}</code></li>
-    </ul>
-    <p>
-        (this is actual <a href="/intent-matching.html#syntax">Intent DSL</a> 
used by NLPCraft -
-        <code>term</code> here is basically what's often referred to as a slot 
in other systems).
+        The software component responsible for finding the named entities is 
called a named entity recognition
+        (NER) component. Its goal is to find a certain entity in the input 
text and optionally extract additional
+        information about this entity. For example, consider the sentence 
"Give me <b>twenty two</b> face masks". A numeric
+        NER component will find numeric entity “twenty two” and will extract 
normalized integer value “22” from it
+        which can be then used further.
     </p>
     <p>
-        We also need to properly configure groups for our entities (names of 
the groups are arbitrary):
+        NER components are usually based on neural networks that are trained 
on extensive and well prepared models
+        (corpus) as well as on simpler rule-based or synonym matching 
algorithms that are better suited for domain
+        specific applications. Note that most universal NER components tend to 
use some variation of neural networks
+        based algorithms.
     </p>
-    <ul>
-        <li>Entity <code>"sell"</code> belongs to group <b>A</b></li>
-        <li>Entity <code>"buy"</code> belongs to group <b>B</b></li>
-        <li>Entity <code>"best_employee"</code> belongs to groups <b>A</b> and 
<b>B</b></li>
-    </ul>
     <p>
-        Let’s run through our example again with this configuration:
-    </p>
-    <ul>
-        <li>
-            <code>"Give me the sales data"</code>
-            <ul>
-                <li>We detected entity from group <b>A</b>.</li>
-                <li>STM is empty at this point.</li>
-                <li>Return general sales report.</li>
-                <li>Store <code>"sell"</code> entity with group <b>A</b> in 
STM.</li>
-            </ul>
-        </li>
-        <li>
-            <code>"Who was the best?"</code>
-            <ul>
-                <li>We detected entity belonging to groups <b>A</b> and 
<b>B</b>.</li>
-                <li>STM has entity belonging to group <b>A</b>.</li>
-                <li><b>{A, B}</b> does NOT override <b>{A}</b>.</li>
-                <li>Return best salesmen report.</li>
-                <li>Store detected <code>"best_employee"</code> entity.</li>
-                <li>STM now has two entities with <b>{A}</b> and <b>{A, B}</b> 
group sets.</li>
-            </ul>
-        </li>
-        <li>
-            <code>"OK, give me the purchasing report now."</code>
-            <ul>
-                <li>We detected <code>"buy"</code> entity with group 
<b>A</b>.</li>
-                <li>STM has two entities with <b>{A}</b> and <b>{A, B}</b> 
group sets.</li>
-                <li><b>{A}</b> overrides both <b>{A}</b> and <b>{A, 
B}</b>.</li>
-                <li>Return general purchasing report.</li>
-                <li>Store <code>"buy"</code> entity with group <b>A</b> in 
STM.</li>
-            </ul>
-        </li>
-    </ul>
-    <p>
-        And so on... easy, huh 😇 In fact, the logic is indeed relatively 
straightforward. It also follows
-        common sense where the logic produced by this rule matches the 
expected human behavior.
+        Rest of this blog will concentrate on a brief review of existing 
popular products and open source projects
+        that provide NER components. We will also look at what Apache NLPCraft 
project brings to the table on this
+        topic as well. Note that this review is far from exhaustive analysis 
of these libraries but rather a quick
+        overview of their pros and cons as of the end of 2020.
     </p>
 </section>
 <section>
-    <h2 class="section-title">Explicit Context Switch</h2>
+    <h2 class="section-title">NER Providers</h2>
     <p>
-        In some cases you may need to explicitly clear the conversation STM 
without relying on algorithmic behavior.
-        It happens when current and new topic of the conversation share some 
of the same entities. Although at first
-        it sounds counter-intuitive there are many examples of that in day to 
day life.
-    </p>
-    <p>
-        Let’s look at this sample conversation:
-    </p>
-    <ul>
-        <li>
-            <b>Q</b>: <code>"What the weather in Tokyo?"</code><br/>
-            <b>A</b>: Current weather in Tokyo...
-        </li>
-        <li>
-            <b>Q</b>: <code>"Let’s do New York after all then!"</code><br/>
-            <b>A</b>: Without an explicit conversation reset we would return 
current New York weather 🤔
-        </li>
-    </ul>
-    <p>
-        The second question was about going to New York (booking tickets, 
etc.). In real life - your
-        counterpart will likely ask what you mean by "doing New York after 
all" and you’ll have to explain
-        the abrupt change in the topic.
-        You can avoid this confusion by simply saying: "Enough about weather! 
Let’s talk about this weekend plans" - after
-        which the second question becomes clear. That sentence is an explicit 
context switch which you can also detect
-        in the NLPCraft model.
-    </p>
-    <p>
-        In NLPCraft you can also explicitly reset conversation context through 
API or by switching the model on the request.
+        Let's take a look at several well-known NLP libraries that provide 
built-in NER components.
     </p>
+    <h3 class="section-sub-title">Apache OpenNLP</h3>
+    <h3 class="section-sub-title">Stanford NLP</h3>
+    <h3 class="section-sub-title">Google Language API</h3>
+    <h3 class="section-sub-title">spacy</h3>
 </section>
-<section>
-    <h2 class="section-title">Summary</h2>
-    <p>
-        Let’s collect all our thoughts on STM into a few bullet points:
-    </p>
-    <ul>
-        <li>
-            Missing entities in incomplete sentences can be auto-recalled from 
STM.
-        </li>
-        <li>
-            Newly detected type/category entity is likely indicating the 
change of topic.
-        </li>
-        <li>
-            The key property of STM is its short-time storage and overriding 
rule.
-        </li>
-        <li>
-            The explicit context switch is an important mechanism.
-        </li>
-    </ul>
-    <div class="bq info">
-        <b>
-            Short-Term Memory
-        </b>
-        <p>
-            It is uncanny how properly implemented STM can make conversational 
interface <b>feel like a normal human
-            conversation</b>. It allows to minimize the amount of parasitic 
dialogs and Q&A driven interfaces
-            without unnecessarily complicating the implementation of such 
systems.
-        </p>
-    </div>
-</section>
+
 
 
diff --git a/images/how_to_find_something_fig1.png 
b/images/how_to_find_something_fig1.png
new file mode 100644
index 0000000..39c29ad
Binary files /dev/null and b/images/how_to_find_something_fig1.png differ

Reply via email to