Revision: 17543
http://sourceforge.net/p/gate/code/17543
Author: adamfunk
Date: 2014-03-05 13:24:25 +0000 (Wed, 05 Mar 2014)
Log Message:
-----------
Tie up a few loose ends in TR.
Modified Paths:
--------------
userguide/trunk/misc-creole.tex
Modified: userguide/trunk/misc-creole.tex
===================================================================
--- userguide/trunk/misc-creole.tex 2014-03-05 10:36:35 UTC (rev 17542)
+++ userguide/trunk/misc-creole.tex 2014-03-05 13:24:25 UTC (rev 17543)
@@ -3256,11 +3256,11 @@
we are now including it in GATE as a response to frequent requests from GATE
users who have read publications related to those projects.
-The easiest way to test TermRaider is to populate a corpus with related
-documents, load the sample
-application (\texttt{plugins/TermRaider/applications/termraider-eng.gapp}),
-and run it. This application will process the documents and create instances
of
-three termbank language resources with sensible parameters.
+The easiest way to try TermRaider is to populate a corpus with related
+documents, load the sample application
+(\texttt{plugins/TermRaider/applications/termraider-eng.gapp}), and run it.
+This application will process the documents and create instances of three
+termbank language resources with sensible parameters.
All the language resources in TermRaider are serializable and can be stored in
GATE datastores.
@@ -3303,19 +3303,21 @@
The \texttt{Term} class is defined in terms of the term string itself, the
-language code, and the annotation type, so it is possible to distinguish
-\emph{affect}(\emph{english},\emph{Noun}) from
-\emph{affect}(\emph{english},\emph{Verb}), and
-\emph{gift}(\emph{english},\emph{Noun}) from
-\emph{gift}(\emph{german},\emph{Noun}).
+language code, and the annotation type, so it is possible (after preprocessing
+the documents properly) to distinguish \emph{affect}(\emph{english},
\emph{Noun})
+from \emph{affect}(\emph{english}, \emph{Verb}), and
+\emph{gift}(\emph{english}, \emph{Noun}) from
+\emph{gift}(\emph{german}, \emph{Noun}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{DocumentFrequencyBank}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This termbank counts the number of documents in which each term is found, and
is
used primarily as input to the TfIdf Termbank. Document frequency can thus be
determined from a reference corpus in advance and used in subsequent
calcuations
-of tf.idf over other corpora.
+of tf.idf over other corpora. This type of termbank has only the principal
+score type.
+
A document frequency bank can be constructed from one or more corpora, from one
or more existing document frequency banks, or from a combination of both, so
that document frequency counts from different sources can be compiled together.
@@ -3327,11 +3329,14 @@
\end{itemize}
-This type of termbank has only the principal score type. When a TfIdf Termbank
-queries this kind for the reference document frequency, two terms are
considered
-a match if both have the same language code or if either has an empty language
-code (in case some applications have been run without language identification
-PRs).
+
+When a TfIdf Termbank queries this type of termbank for the reference document
+frequency, it asks for a strictly matching term (same string, language code,
and
+annotation type), but if that is not found, a lax match is used (if the
+requested term or the matching term has an empty language code---in case some
+applications have been run without language identification PRs). If the term
is
+not in the DocumentFrequencyBank at all, 0 is returned. (The idf calculation,
+described in the next section, has $+1$ terms to prevent division by zero.)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsect{TfIdf Termbank}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3348,11 +3353,18 @@
following options for adjusting inverted document frequency (all adjusted to
prevent division by zero):
\begin{itemize}
- \item \emph{LogarithmicScaled}:
$\mathit{idf}=\log_{2}\frac{n}{1+\mathit{df}}$;
- \item \emph{Logarithmic}: $\mathit{idf}=\log_{2}\frac{1}{1+\mathit{df}}$;
- \item \emph{Scaled}: $\mathit{idf}=\frac{1+n}{1+\mathit{df}}$;
- \item \emph{Natural}: $\mathit{idf}=\frac{1}{1+\mathit{df}}$.
+ \item \emph{LogarithmicScaled}:
$\mathit{idf}=\log_{2}\frac{n}{\mathit{df}+1}$;
+ \item \emph{Logarithmic}: $\mathit{idf}=\log_{2}\frac{1}{\mathit{df}+1}$;
+ \item \emph{Scaled}: $\mathit{idf}=\frac{n+1}{\mathit{df}+1}$;
+ \item \emph{Natural}: $\mathit{idf}=\frac{1}{\mathit{df}+1}$.
\end{itemize}
+\item \textbf{tfCalculation}: an enum (pull-down) with the following options
for
+ adjusting term frequency:
+ \begin{itemize}
+ \item \emph{Natural}: $\mathit{atf}=\mathit{tf}$;
+ \item \emph{Sqrt}: $\mathit{atf}=\sqrt{\mathit{tf}}$;
+ \item \emph{Logarithmic}: $\mathit{atf}=1+\log_{2} \mathit{tf}$.
+ \end{itemize}
\item \textbf{normalization}: an enum (pull-down) with the following options
for
normalizing the raw score $s$, where $s=\mathit{atf}\times\mathit{idf}$:
\begin{itemize}
@@ -3362,13 +3374,6 @@
monotonically to values in the 0--100 range, so that $0{\rightarrow}0$ and
${\infty}{\rightarrow}100$).
\end{itemize}
-\item \textbf{tfCalculation}: an enum (pull-down) with the following options
for
- adjusting term frequency:
- \begin{itemize}
- \item \emph{Natural}: $\mathit{atf}=\mathit{tf}$;
- \item \emph{Sqrt}: $\mathit{atf}=\sqrt{\mathit{tf}}$;
- \item \emph{Logarithmic}: $\mathit{atf}=1+\log_{2} \mathit{tf}$.
- \end{itemize}
\end{itemize}
%%
For the calculations above, $\mathit{tf}$ is the term frequency (number of
This was sent by the SourceForge.net collaborative development platform, the
world's largest Open Source development site.
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs