Revision: 17543
          http://sourceforge.net/p/gate/code/17543
Author:   adamfunk
Date:     2014-03-05 13:24:25 +0000 (Wed, 05 Mar 2014)
Log Message:
-----------
Tie up a few loose ends in TR.

Modified Paths:
--------------
    userguide/trunk/misc-creole.tex

Modified: userguide/trunk/misc-creole.tex
===================================================================
--- userguide/trunk/misc-creole.tex     2014-03-05 10:36:35 UTC (rev 17542)
+++ userguide/trunk/misc-creole.tex     2014-03-05 13:24:25 UTC (rev 17543)
@@ -3256,11 +3256,11 @@
 we are now including it in GATE as a response to frequent requests from GATE
 users who have read publications related to those projects.
 
-The easiest way to test TermRaider is to populate a corpus with related
-documents, load the sample
-application (\texttt{plugins/TermRaider/applications/termraider-eng.gapp}),
-and run it.  This application will process the documents and create instances 
of
-three termbank language resources with sensible parameters.
+The easiest way to try TermRaider is to populate a corpus with related
+documents, load the sample application
+(\texttt{plugins/TermRaider/applications/termraider-eng.gapp}), and run it.
+This application will process the documents and create instances of three
+termbank language resources with sensible parameters.
 
 All the language resources in TermRaider are serializable and can be stored in
 GATE datastores.
@@ -3303,19 +3303,21 @@
 
 
 The \texttt{Term} class is defined in terms of the term string itself, the
-language code, and the annotation type, so it is possible to distinguish
-\emph{affect}(\emph{english},\emph{Noun}) from
-\emph{affect}(\emph{english},\emph{Verb}), and
-\emph{gift}(\emph{english},\emph{Noun}) from
-\emph{gift}(\emph{german},\emph{Noun}).
+language code, and the annotation type, so it is possible (after preprocessing
+the documents properly) to distinguish \emph{affect}(\emph{english}, 
\emph{Noun})
+from \emph{affect}(\emph{english}, \emph{Verb}), and
+\emph{gift}(\emph{english}, \emph{Noun}) from
+\emph{gift}(\emph{german}, \emph{Noun}).
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{DocumentFrequencyBank}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 This termbank counts the number of documents in which each term is found, and 
is
 used primarily as input to the TfIdf Termbank.  Document frequency can thus be
 determined from a reference corpus in advance and used in subsequent 
calcuations
-of tf.idf over other corpora.
+of tf.idf over other corpora.  This type of termbank has only the principal
+score type.
 
+
 A document frequency bank can be constructed from one or more corpora, from one
 or more existing document frequency banks, or from a combination of both, so
 that document frequency counts from different sources can be compiled together.
@@ -3327,11 +3329,14 @@
 \end{itemize}
 
 
-This type of termbank has only the principal score type.  When a TfIdf Termbank
-queries this kind for the reference document frequency, two terms are 
considered
-a match if both have the same language code or if either has an empty language
-code (in case some applications have been run without language identification
-PRs).
+
+When a TfIdf Termbank queries this type of termbank for the reference document
+frequency, it asks for a strictly matching term (same string, language code, 
and
+annotation type), but if that is not found, a lax match is used (if the
+requested term or the matching term has an empty language code---in case some
+applications have been run without language identification PRs).  If the term 
is
+not in the DocumentFrequencyBank at all, 0 is returned.  (The idf calculation,
+described in the next section, has $+1$ terms to prevent division by zero.)
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{TfIdf Termbank}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3348,11 +3353,18 @@
   following options for adjusting inverted document frequency (all adjusted to
   prevent division by zero):
   \begin{itemize}
-  \item \emph{LogarithmicScaled}: 
$\mathit{idf}=\log_{2}\frac{n}{1+\mathit{df}}$;
-  \item \emph{Logarithmic}: $\mathit{idf}=\log_{2}\frac{1}{1+\mathit{df}}$;
-  \item \emph{Scaled}: $\mathit{idf}=\frac{1+n}{1+\mathit{df}}$;
-  \item \emph{Natural}: $\mathit{idf}=\frac{1}{1+\mathit{df}}$.
+  \item \emph{LogarithmicScaled}: 
$\mathit{idf}=\log_{2}\frac{n}{\mathit{df}+1}$;
+  \item \emph{Logarithmic}: $\mathit{idf}=\log_{2}\frac{1}{\mathit{df}+1}$;
+  \item \emph{Scaled}: $\mathit{idf}=\frac{n+1}{\mathit{df}+1}$;
+  \item \emph{Natural}: $\mathit{idf}=\frac{1}{\mathit{df}+1}$.
   \end{itemize}
+\item \textbf{tfCalculation}: an enum (pull-down) with the following options 
for
+  adjusting term frequency:
+  \begin{itemize}
+  \item \emph{Natural}: $\mathit{atf}=\mathit{tf}$;
+  \item \emph{Sqrt}: $\mathit{atf}=\sqrt{\mathit{tf}}$;
+  \item \emph{Logarithmic}: $\mathit{atf}=1+\log_{2} \mathit{tf}$.
+  \end{itemize}
 \item \textbf{normalization}: an enum (pull-down) with the following options 
for
   normalizing the raw score $s$, where $s=\mathit{atf}\times\mathit{idf}$:
   \begin{itemize}
@@ -3362,13 +3374,6 @@
     monotonically to values in the 0--100 range, so that $0{\rightarrow}0$ and
     ${\infty}{\rightarrow}100$).
   \end{itemize}
-\item \textbf{tfCalculation}: an enum (pull-down) with the following options 
for
-  adjusting term frequency:
-  \begin{itemize}
-  \item \emph{Natural}: $\mathit{atf}=\mathit{tf}$;
-  \item \emph{Sqrt}: $\mathit{atf}=\sqrt{\mathit{tf}}$;
-  \item \emph{Logarithmic}: $\mathit{atf}=1+\log_{2} \mathit{tf}$.
-  \end{itemize}
 \end{itemize}
 %%
 For the calculations above, $\mathit{tf}$ is the term frequency (number of

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

Reply via email to