Revision: 17365
          http://sourceforge.net/p/gate/code/17365
Author:   ian_roberts
Date:     2014-02-20 13:58:17 +0000 (Thu, 20 Feb 2014)
Log Message:
-----------
Documentation for WARC support.

Modified Paths:
--------------
    gcp/trunk/doc/batch-def.tex
    gcp/trunk/doc/gcp-guide.pdf

Modified: gcp/trunk/doc/batch-def.tex
===================================================================
--- gcp/trunk/doc/batch-def.tex 2014-02-20 12:14:31 UTC (rev 17364)
+++ gcp/trunk/doc/batch-def.tex 2014-02-20 13:58:17 UTC (rev 17365)
@@ -86,15 +86,16 @@
 to allow it to configure itself.  Thus, which attributes are supported and/or
 required depends on the specific handler class.
 
-GCP provides three standard input handler types:
+GCP provides four standard input handler types:
 
 \bit
 \item \verb!gate.cloud.io.file.FileInputHandler! to read documents from
   individual files on the filesystem
 \item \verb!gate.cloud.io.zip.ZipInputHandler! to read documents directly from
   a ZIP archive
-\item \verb!gate.cloud.io.arc.ARCInputHandler! to read documents from an ARC
-  archive as produced by the Heritrix web crawler
+\item \verb!gate.cloud.io.arc.ARCInputHandler! and
+  \verb!gate.cloud.io.arc.WARCInputHandler! to read documents from an ARC or
+  WARC archive as produced by the Heritrix web crawler
   (\url{http://crawler.archive.org}).
 \eit
 
@@ -151,8 +152,9 @@
 \bde
 \item[encoding] (optional) exactly as for \verb!FileInputHandler!
 \item[mimeType] (optional) exactly as for \verb!FileInputHandler!
-\item[zipFile] (required) The location of the ZIP file from which documents
-  will be read.
+\item[srcFile] (required) The location of the ZIP file from which documents
+  will be read. This parameter was previously named ``zipFile'', the old name
+  is supported for backwards compatibility but not recommended for new batches.
 \item[fileNameEncoding] (optional) The default character encoding to assume for
   file names inside the ZIP file.  This attribute is only relevant if the ZIP
   file contains files whose names contain non-ASCII characters {\em without}
@@ -169,37 +171,65 @@
 The ZIP input handler does not use pluggable naming strategies, and simply
 assumes that the document ID is the path of an entry in the ZIP file.
 
-\subsection{The {\tt ARCInputHandler}}
+\subsection{The {\tt ARCInputHandler} and {\tt WARCInputHandler}}
 
-The ARC input handler reads documents out of an Internet Archive ARC file.  It
-supports the following attributes:
+These two input handlers read documents out of ARC- and WARC format web archive
+files as produced by the Heritrix web crawler and other similar tools.  They
+support the following attributes:
 
 \bde
-\item[arcFile] (required) The location of the ARC file.
+\item[srcFile] (optional) The location of the archive file\footnote{For ARC,
+  this parameter was previously called ``arcFile'', the old name is supported
+  for backwards compatibility but not recommended for new batches.}.  These
+  input handlers can operate in one of two modes -- if \verb!srcFile! \emph{is}
+  specified then the handler will load records from this specific archive file
+  on disk, but if \verb!srcFile! is \emph{not} specified then each document ID
+  must provide a fully qualified http or https URL to an archive.  In the
+  second mode the selected records will be downloaded individually using ``byte
+  range'' HTTP requests.
 \item[defaultEncoding] (optional) The {\em default} character encoding to
-  assume for ARC entries that do not specify their encoding in the entry
+  assume for entries that do not specify their encoding in the entry
   headers.  If an entry specifies its own encoding explicitly this will be
   used.  If this attribute is omitted, ``Windows-1252'' is assumed as the
   default.
 \item[mimeType] (optional) The MIME type that should be assumed when creating
   the document (i.e. the value of the \verb!DocumentImpl! mimeType parameter).
-  If omitted, the MIME type specified by the ARC entry will be used, if
-  present, and if the entry does not specify a MIME type header then the usual
-  GATE Embedded heuristics will apply.
+  If omitted, the usual GATE Embedded heuristics will apply.  The input
+  handlers make the HTTP headers from the archive entry available to GATE as if
+  the document had been downloaded directly from the web, so the
+  \verb!Content-Type! header from the archive entry is available to these
+  heuristics.
 \ede
 
-The ARC input handler expects its document IDs to begin with one or more digits
-(everything from the first non-digit character in the ID is ignored).  These
-leading digits are treated as a zero-based index into the ARC file, i.e. any of
-the IDs ``1'', ``000001'' or ``000001\_http://example.com'' are treated as
-referring to the second entry in the archive (0 would be the first entry).
+The web archive input handlers expect document IDs of the following form:
+\begin{lstlisting}[language=XML]
+<id recordPosition="NNN" [url="optional url of archive"]
+    recordOffset="NNN" recordLength="NNN">{original entry url}</id>
+\end{lstlisting}
 
-The ARC input handler adds all the HTTP headers and ARC record headers for the
-entry as features on the GATE \verb!Document! it creates.  HTTP header names
-are prefixed with ``http\_header\_'' and ARC record headers with
+The content of the \verb!id! element should be the original URL from which the
+entry was crawled, and the attributes are:
+\bde
+\item[recordPosition] a numeric value that
+ is used as a sequence number. If the IDs are generated by the corresponding
+ enumerator (see below), then the this attribute will contain the actual
+ record position inside the archive file.
+\item[recordOffset and recordLength] the byte offset of the required record in
+  the archive, and the record's length in bytes.
+\item[url] (optional) a full HTTP or HTTPS URL to the source archive file.  If
+  this is provided, GCP will download just the specific target record using a
+  ``Range'' header on the HTTP request, rather than loading the record from the
+  input handler's usual \verb!srcFile!.
+\ede
+
+The standard enumerator implementations (see below) create IDs in the correct
+form.
+
+The ARC input handler adds all the HTTP headers and archive record headers for
+the entry as features on the GATE \verb!Document! it creates.  HTTP header
+names are prefixed with ``http\_header\_'' and ARC/WARC record headers with
 ``arc\_header\_''.
 
-
 \section{Specifying the Output Handlers}
 
 Output handlers are responsible for taking the GATE Documents that have been
@@ -288,30 +318,32 @@
 file or ZIP input handler but for batches that use an \verb!ARCInputHandler! a
 different strategy is required.
 
-As document IDs for an \verb!ARCInputHandler! are simple numbers (with an
-optional suffix) the simple strategy would put all the output files into a
-single directory.  Directories with very large numbers of files can lead to
-poor performance on many filesystems, so an alternative strategy is provided
-that left-pads the document ID numbers with zeros and puts them into a
-hierarchy of directories.  To use this strategy, specify an attribute
+As document IDs for an \verb!ARCInputHandler! are based on URLs the simple
+strategy would try to put the output files into directories named after
+absolute URLs, which can include characters that are not permitted in file
+names on all platforms.  An alternative strategy is provided that makes use of
+the \verb!recordPosition! attribute on the IDs to put output files into a
+hierarchy of numbered directories.  To use this strategy, specify an attribute
 \verb!namingStrategy="gate.cloud.io.arc.ARCDocumentNamingStrategy"!, and the
 usual \verb!dir! and \verb!fileExtension! attributes of the default strategy.
 The ARC strategy also accepts an optional additional attribute \verb!pattern!
 defining the pattern to use to map the ID number to a directory.
 
-The default pattern is ``3/3'', which will left-pad the ID to a minimum of 6
-digits and then create one level of directories from the first three digits and
-use the last three as part of the file name\footnote{In fact the pattern is
-processed from right to left, so any surplus digits end up in the first place,
-i.e. the ID 1234567 becomes 1234/567 rather than 123/4567.}.  The trailing
-characters of the document ID after the numeric index are cleaned up to replace
-slash and colon characters with underscores (so the resulting file name will
-not include any more levels of subdirectories).  For full details of this
-process, see the JavaDoc documentation.  As an example, the ID
-``001\_http://example.com/file.html'' with the default pattern of ``3/3'' would
-map to the target path ``000/001\_example.com\_file.html'', and this would then
-be combined with the \verb!dir! and \verb!fileExtension! to produce the final
-file name.
+The default pattern is ``3/3'', which will left-pad the \verb!recordPosition!
+to a minimum of 6 digits and then create one level of directories from the
+first three digits and use the last three as part of the file name\footnote{In
+fact the pattern is processed from right to left, so any surplus digits end up
+in the first place, i.e. the ID 1234567 becomes 1234/567 rather than
+123/4567.}.  The ID text (i.e. the original URL) is cleaned up to remove the
+protocol, query string and fragment (if any) and replace slash and colon
+characters with underscores (so the resulting file name will not include any
+more levels of subdirectories) and appended to the numeric part following an
+underscore.  For full details of this process, see the JavaDoc
+documentation.  As an example, the ID with \verb!recordPosition="1"! and URL
+\verb!http://example.com/file.html! with the default pattern of ``3/3'' would
+map to the target path ``000/001\_example.com\_file.html'', and this
+would then be combined with the \verb!dir! and \verb!fileExtension! to produce
+the final file name.
 
 The \verb!PlainTextOutputHandler! simply saves the plain text of the GATE
 document with no annotations (so \verb!<annotationSet>! filters are ignored).
@@ -468,7 +500,7 @@
 
 The \verb!gate.cloud.io.file.FileDocumentEnumerator! takes a \verb!dir!
 attribute and the \verb!gate.cloud.io.zip.ZipDocumentEnumerator! takes
-\verb!zipFile! and \verb!fileNameEncoding! attributes (as described above for
+\verb!srcFile! and \verb!fileNameEncoding! attributes (as described above for
 their corresponding input handlers) specifying where to find the directory
 or ZIP file to be enumerated.  To define which files (or ZIP entries) to
 enumerate, the enumerators use the ``fileset'' abstraction from Apache Ant,
@@ -501,24 +533,25 @@
 pattern of ``*.xml'' would not match ``FILE.XML'', for example.  To match both
 upper and lower-case variants, include both forms in the pattern.
 
-\subsection{The ARC enumerator}
+\subsection{The ARC and WARC enumerators}
 
-The \verb!gate.cloud.io.arc.ARCDocumentEnumerator! enumerates entries in an ARC
-file, and would typically be used in conjunction with an ARC input handler.
-The enumerator supports the following attributes:
+The \verb!gate.cloud.io.arc.ARCDocumentEnumerator! and
+\verb!WARCDocumentEnumerator! classes enumerate entries in an
+ARC or WARC file, and would typically be used in conjunction with the
+corresponding input handler.  The enumerators support the following attributes:
 
 \bde
-\item[arcFile] (required) the path to the ARC archive to enumerate.
+\item[srcFile] (required) the path to the archive to enumerate.
 \item[mimeTypes] (optional) whitespace-separated list of MIME types.  If
-  specified, the enumerator will only include entries in the ARC file whose
+  specified, the enumerator will only include entries in the archive whose
   header specifies one of the given MIME types.  So a value of
   \verb!"text/html application/pdf"! would enumerate only HTML and PDF files
   from the archive.
-\item[includeStatusCodes] (optional) Each entry in an ARC file records the HTTP
-  status code (200, 301, 404, etc.) that was returned by the server when the
-  item was crawled.  This attribute gives a regular expression that is matched
-  against the status codes that should be included in the enumeration.  If
-  omitted, all status codes are included (except those excluded by
+\item[includeStatusCodes] (optional) Each entry in an archive file records the
+  HTTP status code (200, 301, 404, etc.) that was returned by the server when
+  the item was crawled.  This attribute gives a regular expression that is
+  matched against the status codes that should be included in the enumeration.
+  If omitted, all status codes are included (except those excluded by
   \verb!excludeStatusCodes!).
 \item[excludeStatusCodes] (optional) regular expression giving the status codes
   that should be excluded from the enumeration.  If {\em both}
@@ -527,13 +560,19 @@
   omit all 3xx, 4xx and 5xx status codes).
 \ede
 
-The ARC enumerator returns document IDs of the form ``nnnnnn\_{\em entryURL}'',
-i.e. the zero-based index into the archive, left-padded with zeros to a minimum
-of 6 digits, followed by an underscore, followed by the original URL of the
-archive entry, e.g. \verb!000004_http://example.com/file.html!.  This format is
-designed to work well in combination with the \verb!ARCDocumentNamingStrategy!
-for file-based output handlers.
+The enumerators returns document IDs in the form required by the corresponding
+handlers:
 
+\begin{lstlisting}[language=XML]
+<id recordPosition="{zero-based index into the archive}"
+    recordOffset="{byte offset of the start of this record}"
+    recordLength="{length of the record in bytes}"
+    >{original URL from which the document was crawled}</id>
+\end{lstlisting}
+
+This format is designed to work well in combination with the
+\verb!ARCDocumentNamingStrategy!  for file-based output handlers.
+
 \subsection{The {\tt ListDocumentEnumerator}}
 
 \verb!gate.cloud.io.ListDocumentEnumerator! is the final enumerator

Modified: gcp/trunk/doc/gcp-guide.pdf
===================================================================
(Binary files differ)

This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.


------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
GATE-cvs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gate-cvs

Reply via email to