1...

mattmann Sun, 25 Oct 2015 15:31:26 -0700

Added: tika/site/src/site/apt/1.11/detection.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/1.11/detection.apt?rev=1710493&view=auto
==============================================================================
--- tika/site/src/site/apt/1.11/detection.apt (added)
+++ tika/site/src/site/apt/1.11/detection.apt Sun Oct 25 22:30:51 2015
@@ -0,0 +1,211 @@
+                          -----------------
+                          Content Detection
+                          -----------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Content Detection
+
+   This page gives you information on how content and language detection
+   works with Apache Tika, and how to tune the behaviour of Tika.
+
+%{toc|section=1|fromDepth=1}
+
+* {The Detector Interface}
+
+  The
+  
{{{./api/org/apache/tika/detect/Detector.html}org.apache.tika.detect.Detector}}
+  interface is the basis for most of the content type detection in Apache
+  Tika. All the different ways of detecting content all implement the
+  same common method:
+
+---
+MediaType detect(java.io.InputStream input,
+                 Metadata metadata) throws java.io.IOException
+---
+
+   The <<<detect>>> method takes the stream to inspect, and a 
+   <<<Metadata>>> object that holds any additional information on
+   the content. The detector will return a 
+   {{{./api/org/apache/tika/mime/MediaType.html}MediaType}} object describing
+   its best guess as to the type of the file.
+
+   In general, only two keys on the Metadata object are used by Detectors.
+   These are <<<Metadata.RESOURCE_NAME_KEY>>> which should hold the name
+   of the file (where known), and <<<Metadata.CONTENT_TYPE>>> which should
+   hold the advertised content type of the file (eg from a webserver or
+   a content repository).
+
+
+* {Mime Magic Detection}
+
+  By looking for special ("magic") patterns of bytes near the start of
+  the file, it is often possible to detect the type of the file. For
+  some file types, this is a simple process. For others, typically
+  container based formats, the magic detection may not be enough. (More
+  detail on detecting container formats below)
+
+  Tika is able to make use of a a mime magic info file, in the 
+  {{{http://www.freedesktop.org/standards/shared-mime-info}Freedesktop 
MIME-info}} 
+  format to peform mime magic detection. (Note that Tika supports a few
+  more match types than Freedesktop does)
+
+  This is provided within Tika by
+  
{{{./api/org/apache/tika/detect/MagicDetector.html}org.apache.tika.detect.MagicDetector}}.
 It is most commonly access via
+  
{{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}},
+  normally sourced from the <<<tika-mimetypes.xml>>> and 
<<<custom-mimetypes.xml>>>
+  files. For more information on defining your own custom mimetypes, see
+  {{{./parser_guide.html#Add_your_MIME-Type}the new parser guide}}.
+   
+
+* {Resource Name Based Detection}
+
+  Where the name of the file is known, it is sometimes possible to guess 
+  the file type from the name or extension. Within the 
+  <<<tika-mimetypes.xml>>> file is a list of patterns which are used to
+  identify the type from the filename.
+
+  However, because files may be renamed, this method of detection is quick
+  but not always as accurate.
+
+  This is provided within Tika by
+  
{{{./api/org/apache/tika/detect/NameDetector.html}org.apache.tika.detect.NameDetector}}.
+
+
+* {Known Content Type "Detection}
+
+  Sometimes, the mime type for a file is already known, such as when
+  downloading from a webserver, or when retrieving from a content store.
+  This information can be used by detectors, such as
+  
{{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}},
+
+
+* {The default Mime Types Detector}
+
+  By default, the mime type detection in Tika is provided by
+  
{{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}.
+  This detector makes use of <<<tika-mimetypes.xml>>> to power
+  magic based and filename based detection.
+
+  Firstly, magic based detection is used on the start of the file.
+  If the file is an XML file, then the start of the XML is processed
+  to look for root elements. Next, if available, the filename 
+  (from <<<Metadata.RESOURCE_NAME_KEY>>>) is
+  then used to improve the detail of the detection, such as when magic
+  detects a text file, and the filename hints it's really a CSV. Finally,
+  if available, the supplied content type (from <<<Metadata.CONTENT_TYPE>>>)
+  is used to further refine the type.
+
+
+* {Container Aware Detection}
+
+  Several common file formats are actually held within a common container
+  format. One example is the PowerPoint .ppt and Word .doc formats, which
+  are both held within an OLE2 container. Another is Apple iWork formats,
+  which are actually a series of XML files within a Zip file.
+
+  Using magic detection, it is easy to spot that a given file is an OLE2
+  document, or a Zip file. Using magic detection alone, it is very difficult
+  (and often impossible) to tell what kind of file lives inside the container.
+
+  For some use cases, speed is important, so having a quick way to know the
+  container type is sufficient. For other cases however, you don't mind 
+  spending a bit of time (and memory!) processing the container to get a 
+  more accurate answer on its contents. For these cases, the additional
+  container aware detectors contained in the <<<Tika Parsers>>> jar should
+  be used.
+
+  Tika provides a wrapping detector in the form of 
+  
{{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}}.
+  This uses the service loader to discover all available detectors, including
+  any available container aware ones, and tries them in turn. For container
+  aware detection, include the <<<Tika Parsers>>> jar and its dependencies
+  in your project, then use DefaultDetector along with a <<<TikaInputStream>>>.
+
+  Because these container detectors needs to read the whole file to open and
+  inspect the container, they must be used with a 
+  
{{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}.
+  If called with a regular <<<InputStream>>>, then all work will be done
+  by the default Mime Magic detection only.
+
+  For more information on container formats and Tika, see
+  {{{http://wiki.apache.org/tika/MetadataDiscussion}}}
+
+
+* {The default Tika Detector}
+
+  Just as with Parsers, Tika provides a special detector
+  
{{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}}
+  which auto-detects (based on service files) the available detectors at 
+  runtime, and tries these in turn to identify the file type.
+
+  If only <<<Tika Core>>> is available, the Default Detector will work only
+  with Mime Magic and Resource Name detection. However, if <<<Tika Parsers>>>
+  (and its dependencies!) are available, additional detectors which known about
+  containers (such as zip and ole2) will be used as appropriate, provided that
+  detection is being performed with a
+  
{{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}.
+  Custom detectors can also be used as desired, they simply need to be listed
+  in a service file much as is done for
+  {{{./parser_guide.html#List_the_new_parser}custom parsers}}.
+
+
+* {Ways of triggering Detection}
+
+  The simplest way to detect is through the 
+  {{{./api/org/apache/tika/Tika.html}Tika Facade class}}, which provides 
methods to
+  detect based on
+  {{{./api/org/apache/tika/Tika.html##detect(java.io.File)}File}},
+  
{{{./api/org/apache/tika/Tika.html##detect(java.io.InputStream)}InputStream}},
+  {{{./api/org/apache/tika/Tika.html##detect(java.io.InputStream, 
java.lang.String)}InputStream and Filename}},
+  {{{./api/org/apache/tika/Tika.html##detect(java.lang.String)}Filename}} or a 
few others.
+  It works best with a File or 
+  {{{./api/org/apache/tika/io/TikaInputStream.html}TikaInputStream}}.
+
+  Alternately, detection can be performed on a specific Detector, or using
+  <<<DefaultDetector>>> to have all available Detectors used. A typical pattern
+  would be something like:
+
+---
+TikaConfig tika = new TikaConfig();
+
+for (File f : myListOfFiles) {
+   Metadata metadata = new Metadata();
+   metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString());
+   String mimetype = tika.getDetector().detect(
+        TikaInputStream.get(f), metadata);
+   System.out.println("File " + f + " is " + mimetype);
+}
+for (InputStream is : myListOfStreams) {
+   String mimetype = tika.getDetector().detect(
+        TikaInputStream.get(is), new Metadata());
+   System.out.println("Stream " + is + " is " + mimetype);
+}
+---
+
+* {Language Detection}
+
+  Tika is able to help identify the language of a piece of text, which
+  is useful when extracting text from document formats which do not include
+  language information in their metadata.
+
+  The language detection is provided by
+  
{{{./api/org/apache/tika/language/LanguageIdentifier.html}org.apache.tika.language.LanguageIdentifier}}
+
+* {More Examples}
+
+  For more examples of Detection using Apache Tika, please take a look at
+  the {{{./examples.html}Tika Examples page}}.


Added: tika/site/src/site/apt/1.11/gettingstarted.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/1.11/gettingstarted.apt?rev=1710493&view=auto
==============================================================================
--- tika/site/src/site/apt/1.11/gettingstarted.apt (added)
+++ tika/site/src/site/apt/1.11/gettingstarted.apt Sun Oct 25 22:30:51 2015
@@ -0,0 +1,217 @@
+                     --------------------------------
+                     Getting Started with Apache Tika
+                     --------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Getting Started with Apache Tika
+
+ This document describes how to build Apache Tika from sources and
+ how to start using Tika in an application.
+
+Getting and building the sources
+
+ To build Tika from sources you first need to either
+ {{{../download.html}download}} a source release or
+ {{{../source-repository.html}checkout}} the latest sources from
+ version control.
+
+ Once you have the sources, you can build them using the
+ {{{http://maven.apache.org/}Maven 2}} build system. Executing the
+ following command in the base directory will build the sources
+ and install the resulting artifacts in your local Maven repository.
+
+---
+mvn install
+---
+
+ See the Maven documentation for more information about the available
+ build options.
+
+ Note that you need Java 7 or higher to build Tika.
+
+Build artifacts
+
+ The Tika build consists of a number of components and produces
+ the following main binaries:
+
+ [tika-core/target/tika-core-*.jar]
+  Tika core library. Contains the core interfaces and classes of Tika,
+  but none of the parser implementations. Depends only on Java 6.
+
+ [tika-parsers/target/tika-parsers-*.jar]
+  Tika parsers. Collection of classes that implement the Tika Parser
+  interface based on various external parser libraries.
+
+ [tika-app/target/tika-app-*.jar]
+  Tika application. Combines the above components and all the external
+  parser libraries into a single runnable jar with a GUI and a command
+  line interface.
+
+ [tika-server/target/tika-server-*.jar]
+  Tika JAX-RS REST application. This is a Jetty web server running Tika
+  REST services as described in {{{http://wiki.apache.org/tika/TikaJAXRS}this 
page}}.
+
+ [tika-bundle/target/tika-bundle-*.jar]
+  Tika bundle. An OSGi bundle that combines tika-parsers with non-OSGified
+  parser libraries to make them easy to deploy in an OSGi environment.
+
+Using Tika as a Maven dependency
+
+ The core library, tika-core, contains the key interfaces and classes of Tika
+ and can be used by itself if you don't need the full set of parsers from
+ the tika-parsers component. The tika-core dependency looks like this:
+
+---
+  <dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-core</artifactId>
+    <version>...</version>
+  </dependency>
+---
+
+ If you want to use Tika to parse documents (instead  of simply detecting
+ document types, etc.), you'll want to depend on tika-parsers instead: 
+
+---
+  <dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-parsers</artifactId>
+    <version>...</version>
+  </dependency>
+---
+
+ Note that adding this dependency will introduce a number of
+ transitive dependencies to your project, including one on tika-core.
+ You need to make sure that these dependencies won't conflict with your
+ existing project dependencies. You can use the following command in
+ the tika-parsers directory to get a full listing of all the dependencies.
+
+---
+$ mvn dependency:tree | grep :compile
+---
+
+Using Tika in an Ant project
+
+ Unless you use a dependency manager tool like
+ {{{http://ant.apache.org/ivy/}Apache Ivy}}, the easiest way to use
+ Tika is to include either the tika-core or the tika-app jar in your
+ classpath, depending on whether you want just the core functionality
+ or also all the parser implementations.
+
+---
+<classpath>
+  ... <!-- your other classpath entries -->
+
+  <!-- either: -->
+  <pathelement location="path/to/tika-core-${tika.version}.jar"/>
+  <!-- or: -->
+  <pathelement location="path/to/tika-app-${tika.version}.jar"/>
+
+</classpath>
+---
+
+Using Tika as a command line utility
+
+ The Tika application jar (tika-app-*.jar) can be used as a command
+ line utility for extracting text content and metadata from all sorts of
+ files. This runnable jar contains all the dependencies it needs, so
+ you don't need to worry about classpath settings to run it.
+
+ The usage instructions are shown below.
+
+---
+usage: java -jar tika-app.jar [option...] [file|port...]
+
+Options:
+    -?  or --help          Print this usage message
+    -v  or --verbose       Print debug level messages
+    -V  or --version       Print the Apache Tika version number
+
+    -g  or --gui           Start the Apache Tika GUI
+    -s  or --server        Start the Apache Tika server
+    -f  or --fork          Use Fork Mode for out-of-process extraction
+
+    -x  or --xml           Output XHTML content (default)
+    -h  or --html          Output HTML content
+    -t  or --text          Output plain text content
+    -T  or --text-main     Output plain text content (main content only)
+    -m  or --metadata      Output only metadata
+    -j  or --json          Output metadata in JSON
+    -y  or --xmp           Output metadata in XMP
+    -l  or --language      Output only language
+    -d  or --detect        Detect document type
+    -eX or --encoding=X    Use output encoding X
+    -pX or --password=X    Use document password X
+    -z  or --extract       Extract all attachements into current directory
+    --extract-dir=<dir>    Specify target directory for -z
+    -r  or --pretty-print  For XML and XHTML outputs, adds newlines and
+                           whitespace, for better readability
+
+    --create-profile=X
+         Create NGram profile, where X is a profile name
+    --list-parsers
+         List the available document parsers
+    --list-parser-details
+         List the available document parsers, and their supported mime types
+    --list-detectors
+         List the available document detectors
+    --list-met-models
+         List the available metadata models, and their supported keys
+    --list-supported-types
+         List all known media types and related information
+
+Description:
+    Apache Tika will parse the file(s) specified on the
+    command line and output the extracted text content
+    or metadata to standard output.
+
+    Instead of a file name you can also specify the URL
+    of a document to be parsed.
+
+    If no file name or URL is specified (or the special
+    name "-" is used), then the standard input stream
+    is parsed. If no arguments were given and no input
+    data is available, the GUI is started instead.
+
+- GUI mode
+
+    Use the "--gui" (or "-g") option to start the
+    Apache Tika GUI. You can drag and drop files from
+    a normal file explorer to the GUI window to extract
+    text content and metadata from the files.
+
+- Server mode
+
+    Use the "--server" (or "-s") option to start the
+    Apache Tika server. The server will listen to the
+    ports you specify as one or more arguments.
+---
+
+ You can also use the jar as a component in a Unix pipeline or
+ as an external tool in many scripting languages.
+
+---
+# Check if an Internet resource contains a specific keyword
+curl http://.../document.doc \
+  | java -jar tika-app.jar --text \
+  | grep -q keyword
+---
+
+Wrappers
+
+  Several wrappers are available to use Tika in another programming language, 
+  such as {{{https://github.com/aviks/Taro.jl}Julia}} or 
{{{https://github.com/chrismattmann/tika-python}Python}}.

Added: tika/site/src/site/apt/1.11/index.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/1.11/index.apt?rev=1710493&view=auto
==============================================================================
--- tika/site/src/site/apt/1.11/index.apt (added)
+++ tika/site/src/site/apt/1.11/index.apt Sun Oct 25 22:30:51 2015
@@ -0,0 +1,128 @@
+                       ----------------
+                       Apache Tika 1.11
+                       ----------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika 1.11
+
+   The most notable changes in Tika 1.11 over the previous release are:
+   
+    * Fix regression with spacing in PPT via Andreas Beeker 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1777}TIKA-1777}}).
+
+    * Java7 API support for allowing java.nio.file.Path as method arguments
+      was added to Tika and to ParsingReader, TikaFileTypeDetector, and to
+      Tika Config 
({{{http://issues.apache.org/jira/browse/TIKA-1745}TIKA-1745}}, 
+      {{{http://issues.apache.org/jira/browse/TIKA-1746}TIKA-1746}}, 
+      {{{http://issues.apache.org/jira/browse/TIKA-1751}TIKA-1751}}).
+
+    * MIME support was added for WebVTT: The Web Video Text Tracks Format
+      files ({{{http://issues.apache.org/jira/browse/TIKA-1772}TIKA-1772}}).
+
+    * MIME magic improved to ensure emails detected as message/rfc822
+      ({{{http://issues.apache.org/jira/browse/TIKA-1771}TIKA-1771}}).
+
+    * Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility
+      with Bouncy Castle 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1736}TIKA-1736}}).
+  
+    * Make div and other markup more consistent between PPT and 
+      PPTX ({{{http://issues.apache.org/jira/browse/TIKA-1755}TIKA-1755}}).
+
+    * Parse multiple authors from MSOffice's semi-colon delimited
+      author field 
({{{http://issues.apache.org/jira/browse/TIKA-1765}TIKA-1765}}).
+  
+    * Include CTAKESConfig.properties within tika-parsers resources 
+      by default 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1741}TIKA-1741}}).
+  
+    * Prevent infinite recursion when processing inline images
+      in PDF files by limiting extraction of duplicate images
+      within the same page 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1742}TIKA-1742}}).
+
+    * Upgrade to POI 3.13-final (via Andreas Beeker) 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1707}TIKA-1707}}).
+
+    * Upgraded tika-batch to use Path throughout (TIKA-1747 and
+      (TIKA-1754).
+
+    * Upgraded to Path in TikaInputStream (via Yaniv Kunda) 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1744}TIKA-1744}}).
+
+    * Changed default content handler type for "/rmeta" in tika-server
+      to "xml" to align with "-J" option in tika-app.  
+      Clients can now specify handler types via PathParam. 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1716}TIKA-1716}}).
+
+    * The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data
+      for machine learning from PDF files is now integrated as a 
+      Tika parser 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1699}TIKA-1699}}, 
+       {{{http://issues.apache.org/jira/browse/TIKA-1712}TIKA-1712}}).
+
+    * The ability to specify the Tesseract Config Path was added
+      to the OCR Parser 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1703}TIKA-1703}}).
+
+    * Upgraded to ASM 5.0.4 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1705}TIKA-1705}}).
+
+    * Corrected Tika Config XML detector definition explicit loading 
+      of MimeTypes 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1708}TIKA-1708}})
+
+    * In Tika Parsers, Batch, Server, App and Examples, use Apache
+      Commons IO instead of inlined ex-Commons classes, and the Java 7
+      Standard Charset definitions 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1710}TIKA-1710}})
+
+    * Upgraded to Commons Compress 1.10, which enables zlib compressed
+      archives support 
+      ({{{http://issues.apache.org/jira/browse/TIKA-1718}TIKA-1718}})
+
+
+   The following people have contributed to Tika 1.11 by submitting or
+   commenting on the issues resolved in this release:
+
+    * Alexander Widera
+
+    * Bob Paulin
+
+    * Chris A. Mattmann
+
+    * Christian Wolfe
+
+    * Jeremy B. Merrill
+
+    * Jukka Zitting
+
+    * Justin Palmer
+
+    * Konstantin Gribov
+
+    * Lewis John McGibbney
+
+    * Nick Burch
+
+    * Sujen Shah
+
+    * Tim Allison
+
+    * Yaniv Kunda
+
+   See {{http://s.apache.org/fSj}} for more details on these contributions.

Added: tika/site/src/site/apt/1.11/parser.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/1.11/parser.apt?rev=1710493&view=auto
==============================================================================
--- tika/site/src/site/apt/1.11/parser.apt (added)
+++ tika/site/src/site/apt/1.11/parser.apt Sun Oct 25 22:30:51 2015
@@ -0,0 +1,251 @@
+                       --------------------
+                       The Parser interface
+                       --------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+The Parser interface
+
+   The
+   {{{./api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
+   interface is the key concept of Apache Tika. It hides the complexity of
+   different file formats and parsing libraries while providing a simple and
+   powerful mechanism for client applications to extract structured text
+   content and metadata from all sorts of documents. All this is achieved
+   with a single method:
+
+---
+void parse(
+    InputStream stream, ContentHandler handler, Metadata metadata,
+    ParseContext context) throws IOException, SAXException, TikaException;
+---
+
+   The <<<parse>>> method takes the document to be parsed and related metadata
+   as input and outputs the results as XHTML SAX events and extra metadata.
+   The parse context argument is used to specify context information (like
+   the current local) that is not related to any individual document.
+   The main criteria that lead to this design were:
+
+   [Streamed parsing] The interface should require neither the client
+     application nor the parser implementation to keep the full document
+     content in memory or spooled to disk. This allows even huge documents
+     to be parsed without excessive resource requirements.
+
+   [Structured content] A parser implementation should be able to
+     include structural information (headings, links, etc.) in the extracted
+     content. A client application can use this information for example to
+     better judge the relevance of different parts of the parsed document.
+
+   [Input metadata] A client application should be able to include metadata
+     like the file name or declared content type with the document to be
+     parsed. The parser implementation can use this information to better
+     guide the parsing process.
+
+   [Output metadata] A parser implementation should be able to return
+     document metadata in addition to document content. Many document
+     formats contain metadata like the name of the author that may be useful
+     to client applications.
+
+   [Context sensitivity] While the default settings and behaviour of Tika
+     parsers should work well for most use cases, there are still situations
+     where more fine-grained control over the parsing process is desirable.
+     It should be easy to inject such context-specific information to the
+     parsing process without breaking the layers of abstraction.
+
+   []
+
+   These criteria are reflected in the arguments of the <<<parse>>> method.
+
+* Document input stream
+
+   The first argument is an
+   
{{{http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html}InputStream}}
+   for reading the document to be parsed.
+
+   If this document stream can not be read, then parsing stops and the thrown
+   
{{{http://docs.oracle.com/javase/6/docs/api/java/io/IOException.html}IOException}}
+   is passed up to the client application. If the stream can be read but
+   not parsed (for example if the document is corrupted), then the parser
+   throws a 
{{{./api/org/apache/tika/exception/TikaException.html}TikaException}}.
+
+   The parser implementation will consume this stream but <will not close it>.
+   Closing the stream is the responsibility of the client application that
+   opened it in the first place. The recommended pattern for using streams
+   with the <<<parse>>> method is:
+
+---
+InputStream stream = ...;      // open the stream
+try {
+    parser.parse(stream, ...); // parse the stream
+} finally {
+    stream.close();            // close the stream
+}
+---
+
+   Some document formats like the OLE2 Compound Document Format used by
+   Microsoft Office are best parsed as random access files. In such cases the
+   content of the input stream is automatically spooled to a temporary file
+   that gets removed once parsed. A future version of Tika may make it possible
+   to avoid this extra file if the input document is already a file in the
+   local file system. See
+   {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
+   of this feature request.
+
+* XHTML SAX events
+
+   The parsed content of the document stream is returned to the client
+   application as a sequence of XHTML SAX events. XHTML is used to express
+   structured content of the document and SAX events enable streamed
+   processing. Note that the XHTML format is used here only to convey
+   structural information, not to render the documents for browsing!
+
+   The XHTML SAX events produced by the parser implementation are sent to a
+   
{{{http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+   instance given to the <<<parse>>> method. If this the content handler
+   fails to process an event, then parsing stops and the thrown
+   
{{{http://docs.oracle.com/javase/6/docs/api/org/xml/sax/SAXException.html}SAXException}}
+   is passed up to the client application.
+
+   The overall structure of the generated event stream is (with indenting
+   added for clarity):
+
+---
+<html xmlns="http://www.w3.org/1999/xhtml";>
+  <head>
+    <title>...</title>
+  </head>
+  <body>
+    ...
+  </body>
+</html>
+---
+
+   Parser implementations typically use the
+   {{{./api/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
+   utility class to generate the XHTML output.
+
+   Dealing with the raw SAX events can be a bit complex, so Apache Tika
+   comes with a number of utility classes that can be used to process and
+   convert the event stream to other representations.
+
+   For example, the
+   {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+   class can be used to extract just the body part of the XHTML output and
+   feed it either as SAX events to another content handler or as characters
+   to an output stream, a writer, or simply a string. The following code
+   snippet parses a document from the standard input stream and outputs the
+   extracted text content to standard output:
+
+---
+ContentHandler handler = new BodyContentHandler(System.out);
+parser.parse(System.in, handler, ...);
+---
+
+   Another useful class is
+   {{{./api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
+   uses a background thread to parse the document and returns the extracted
+   text content as a character stream:
+
+---
+InputStream stream = ...; // the document to be parsed
+Reader reader = new ParsingReader(parser, stream, ...);
+try {
+    ...;                  // read the document text using the reader
+} finally {
+    reader.close();       // the document stream is closed automatically
+}
+---
+
+* Document metadata
+
+   The third argument to the <<<parse>>> method is used to pass document
+   metadata both in and out of the parser. Document metadata is expressed
+   as an {{{./api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
+
+   The following are some of the more interesting metadata properties:
+
+   [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
+    the document.
+
+    A client application can set this property to allow the parser to use
+    file name heuristics to determine the format of the document.
+
+    The parser implementation may set this property if the file format
+    contains the canonical name of the file (for example the Gzip format
+    has a slot for the file name).
+
+   [Metadata.CONTENT_TYPE] The declared content type of the document.
+
+    A client application can set this property based on for example a HTTP
+    Content-Type header. The declared content type may help the parser to
+    correctly interpret the document.
+
+    The parser implementation sets this property to the content type according
+    to which the document was parsed.
+
+   [Metadata.TITLE] The title of the document.
+
+    The parser implementation sets this property if the document format
+    contains an explicit title field.
+
+   [Metadata.AUTHOR] The name of the author of the document.
+
+    The parser implementation sets this property if the document format
+    contains an explicit author field.
+
+   []
+
+   Note that metadata handling is still being discussed by the Tika development
+   team, and it is likely that there will be some (backwards incompatible)
+   changes in metadata handling before Tika 1.0.
+
+* Parse context
+
+
+   The final argument to the <<<parse>>> method is used to inject
+   context-specific information to the parsing process. This is useful
+   for example when dealing with locale-specific date and number formats
+   in Microsoft Excel spreadsheets. Another important use of the parse
+   context is passing in the delegate parser instance to be used by
+   two-phase parsers like the
+   {{{./api/org/apache/parser/pkg/PackageParser.html}PackageParser}} 
subclasses.
+   Some parser classes allow customization of the parsing process through
+   strategy objects in the parse context.
+
+* Parser implementations
+
+   Apache Tika comes with a number of parser classes for parsing
+   {{{./formats.html}various document formats}}. You can also extend Tika
+   with your own parsers, and of course any contributions to Tika are
+   warmly welcome.
+
+   The goal of Tika is to reuse existing parser libraries like
+   {{{http://pdfbox.apache.org/}PDFBox}} or
+   {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
+   of the parser classes in Tika are adapters to such external libraries.
+
+   Tika also contains some general purpose parser implementations that are
+   not targeted at any specific document formats. The most notable of these
+   is the 
{{{./api/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
+   class that encapsulates all Tika functionality into a single parser that
+   can handle any types of documents. This parser will automatically determine
+   the type of the incoming document based on various heuristics and will then
+   parse the document accordingly.
+
+* {More Examples}
+
+  For more examples of calling Parsing with Apache Tika, please take a look at
+  the {{{./examples.html}Tika Examples page}}.

Added: tika/site/src/site/apt/1.11/parser_guide.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/1.11/parser_guide.apt?rev=1710493&view=auto
==============================================================================
--- tika/site/src/site/apt/1.11/parser_guide.apt (added)
+++ tika/site/src/site/apt/1.11/parser_guide.apt Sun Oct 25 22:30:51 2015
@@ -0,0 +1,143 @@
+                       --------------------------------------------
+                       Get Tika parsing up and running in 5 minutes
+                       --------------------------------------------
+                                          Arturo Beltran
+                                          
--------------------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Get Tika parsing up and running in 5 minutes
+
+   This page is a quick start guide showing how to add a new parser to Apache 
Tika.
+   Following the simple steps listed below your new parser can be running in 
only 5 minutes.
+
+%{toc|section=1|fromDepth=1}
+
+* {Getting Started}
+
+   The {{{./gettingstarted.html}Getting Started}} document describes how to 
+   build Apache Tika from sources and how to start using Tika in an 
application. Pay close attention 
+   and follow the instructions in the "Getting and building the sources" 
section.
+   
+
+* {Add your MIME-Type}
+
+   Tika loads the core, standard MIME-Types from the file 
+   "org/apache/tika/mime/tika-mimetypes.xml", which comes from
+   
{{{http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}}
 . 
+   If your new MIME-Type is a standard one which is missing from Tika, 
+   submit a patch for this file!
+
+   If your MIME-Type needs adding, create a new file 
+   "org/apache/tika/mime/custom-mimetypes.xml" in your codebase. 
+   You should add to it something like this:
+   
+---
+ <?xml version="1.0" encoding="UTF-8"?>
+ <mime-info>
+   <mime-type type="application/hello">
+         <glob pattern="*.hi"/>
+   </mime-type>
+ </mime-info>
+---
+
+* {Create your Parser class}
+
+   Now, you need to create your new parser. This is a class that must 
+   implement the Parser interface offered by Tika. Instead of implementing 
+   the Parser interface directly, it is recommended that you extend the
+   abstract class AbstractParser if possible. AbstractParser handles
+   translating between API changes for you.
+
+   A very simple Tika Parser looks like this:
+   
+---
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ * 
+ * @Author: Arturo Beltran
+ */
+package org.apache.tika.parser.hello;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Collections;
+import java.util.Set;
+
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.mime.MediaType;
+import org.apache.tika.parser.ParseContext;
+import org.apache.tika.parser.AbstractParser;
+import org.apache.tika.sax.XHTMLContentHandler;
+import org.xml.sax.ContentHandler;
+import org.xml.sax.SAXException;
+
+public class HelloParser extends AbstractParser {
+
+       private static final Set<MediaType> SUPPORTED_TYPES = 
Collections.singleton(MediaType.application("hello"));
+       public static final String HELLO_MIME_TYPE = "application/hello";
+       
+       public Set<MediaType> getSupportedTypes(ParseContext context) {
+               return SUPPORTED_TYPES;
+       }
+
+       public void parse(
+                       InputStream stream, ContentHandler handler,
+                       Metadata metadata, ParseContext context)
+                       throws IOException, SAXException, TikaException {
+
+               metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
+               metadata.set("Hello", "World");
+
+               XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, 
metadata);
+               xhtml.startDocument();
+               xhtml.endDocument();
+       }
+}
+---
+   
+   Pay special attention to the definition of the SUPPORTED_TYPES static class 
+   field in the parser class that defines what MIME-Types it supports. If
+   your MIME-Types aren't standard ones, ensure you listed them in a 
+   "custom-mimetypes.xml" file so that Tika knows about them (see above).
+   
+   Is in the "parse" method where you will do all your work. This is, extract 
+   the information of the resource and then set the metadata.
+
+* {List the new parser}
+
+   Finally, you should explicitly tell the AutoDetectParser to include your 
new 
+   parser. This step is only needed if you want to use the AutoDetectParser 
functionality. 
+   If you figure out the correct parser in a different way, it isn't needed. 
+   
+   List your new parser in:
+    
{{{http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}}
+   
+

Modified: tika/site/src/site/apt/download.apt.vm
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/download.apt.vm?rev=1710493&r1=1710492&r2=1710493&view=diff
==============================================================================
--- tika/site/src/site/apt/download.apt.vm (original)
+++ tika/site/src/site/apt/download.apt.vm Sun Oct 25 22:30:51 2015
@@ -25,18 +25,18 @@ Download Apache Tika
 
    * 
{{{http://www.apache.org/dyn/closer.cgi/tika/tika-${project.parent.version}-src.zip}Mirrors
 for apache-tika-${project.parent.version}-src.zip}}
      (source archive, 
{{{http://www.apache.org/dist/tika/tika-${project.parent.version}-src.zip.asc}PGP
 signature}})\
-     SHA1: <<<b1573adcb194e2c09b77eccc3b1edd16bd4ac67d>>>\
-     MD5: <<<092d8bbc51756b180a8d65bbd4620801>>>
+     SHA1: <<<d0dde7b3a4f1a2fb6ccd741552ea180dddab630a>>>\
+     MD5: <<<ccca11a7e5c300e438b2a52012cf4e39>>>
 
    * 
{{{http://www.apache.org/dyn/closer.cgi/tika/tika-app-${project.parent.version}.jar}Mirrors
 for tika-app-${project.parent.version}.jar}}
      (runnable jar, 
{{{http://www.apache.org/dist/tika/tika-app-${project.parent.version}.jar.asc}PGP
 signature}})\
-     SHA1: <<<8803a37c5c9467058a4e116beaa97668dad192e1>>>\
-     MD5: <<<a899be6467e446031315926c10b8763c>>>\
+     SHA1: <<<59cc7c4c48a6a41899ca282d925b2738d05a45a8>>>\
+     MD5: <<<3e133bcb3cd709fddd1bda3eebc1a0e5>>>\
 
    * 
{{{http://www.apache.org/dyn/closer.cgi/tika/tika-server-${project.parent.version}.jar}Mirrors
 for tika-server-${project.parent.version}.jar}}
      (runnable jar, 
{{{http://www.apache.org/dist/tika/tika-server-${project.parent.version}.jar.asc}PGP
 signature}})\
-     SHA1: <<<7bbecca884fa014d40d4468967e9bbd74a64a273>>>\
-     MD5: <<<973965a14c73a93315e756e62a18e8a0>>>
+     SHA1: <<<c1ca6453573fb7fa1f6b3d81dc4c9847a9a86a62>>>\
+     MD5: <<<7e28f3288c3bcd0c26ac6f557ddfb977>>>
 
    []
 

Modified: tika/site/src/site/apt/index.apt.vm
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/index.apt.vm?rev=1710493&r1=1710492&r2=1710493&view=diff
==============================================================================
--- tika/site/src/site/apt/index.apt.vm (original)
+++ tika/site/src/site/apt/index.apt.vm Sun Oct 25 22:30:51 2015
@@ -39,6 +39,15 @@ Apache Tika - a content analysis toolkit
 
 Latest News
 
+   [25 October 2015: Apache Tika Release]
+    Apache Tika 1.11 has been released! This release includes several 
improvements
+    that better utilize Java7 support, that help extract more content using the
+    cTAKES clinical extraction system and GROBID journal parser, and 
improvements
+    to Tesseract extraction. Please see the 
+    
{{{https://dist.apache.org/repos/dist/release/tika/CHANGES-1.11.txt}CHANGES.txt}}
+    file for a full list of changes in this release and have a look at the 
download
+    page for more information on how to obtain Apache Tika 1.11.
+
    [01 August 2015: Apache Tika Release]
     Apache Tika 1.10 has been released! This release includes several 
improvements 
     including the ability to parse MS Access Files, composite parser creation 
via Tika 

Modified: tika/site/src/site/site.xml
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/site.xml?rev=1710493&r1=1710492&r2=1710493&view=diff
==============================================================================
--- tika/site/src/site/site.xml (original)
+++ tika/site/src/site/site.xml Sun Oct 25 22:30:51 2015
@@ -40,7 +40,17 @@
       <item name="Issue Tracker" 
href="https://issues.apache.org/jira/browse/TIKA"/>
     </menu>
     <menu name="Documentation">
-      <item name="Apache Tika 1.10" href="1.10/index.html">
+      <item name="Apache Tika 1.11" href="1.11/index.html">
+        <item name="Getting Started" href="1.11/gettingstarted.html"/>
+        <item name="Supported Formats" href="1.11/formats.html"/>
+        <item name="Parser API" href="1.11/parser.html"/>
+        <item name="Parser 5min Quick Start Guide" 
href="1.11/parser_guide.html"/>
+        <item name="Content and Language Detection" 
href="1.11/detection.html"/>
+        <item name="Configuring Tika" href="1.11/configuring.html"/>
+        <item name="Usage Examples" href="1.11/examples.html"/>
+        <item name="API Documentation" href="1.11/api/"/>
+      </item>
+      <item name="Apache Tika 1.10" href="1.10/index.html" collapse="true">
         <item name="Getting Started" href="1.10/gettingstarted.html"/>
         <item name="Supported Formats" href="1.10/formats.html"/>
         <item name="Parser API" href="1.10/parser.html"/>
@@ -69,63 +79,6 @@
         <item name="Usage Examples" href="1.8/examples.html"/>
         <item name="API Documentation" href="1.8/api/"/>
       </item>
-      <item name="Apache Tika 1.7" href="1.7/index.html" collapse="true">
-        <item name="Getting Started" href="1.7/gettingstarted.html"/>
-        <item name="Supported Formats" href="1.7/formats.html"/>
-        <item name="Parser API" href="1.7/parser.html"/>
-        <item name="Parser 5min Quick Start Guide" 
href="1.7/parser_guide.html"/>
-        <item name="Content and Language Detection" href="1.7/detection.html"/>
-        <item name="Usage Examples" href="1.7/examples.html"/>
-        <item name="API Documentation" href="1.7/api/"/>
-      </item>
-      <item name="Apache Tika 1.6" href="1.6/index.html" collapse="true">
-        <item name="Getting Started" href="1.6/gettingstarted.html"/>
-        <item name="Supported Formats" href="1.6/formats.html"/>
-        <item name="Parser API" href="1.6/parser.html"/>
-        <item name="Parser 5min Quick Start Guide" 
href="1.6/parser_guide.html"/>
-        <item name="Content and Language Detection" href="1.6/detection.html"/>
-        <item name="API Documentation" href="1.6/api/"/>
-      </item>
-      <item name="Apache Tika 1.5" href="1.5/index.html" collapse="true">
-        <item name="Getting Started" href="1.5/gettingstarted.html"/>
-        <item name="Supported Formats" href="1.5/formats.html"/>
-        <item name="Parser API" href="1.5/parser.html"/>
-        <item name="Parser 5min Quick Start Guide" 
href="1.5/parser_guide.html"/>
-        <item name="Content and Language Detection" href="1.5/detection.html"/>
-        <item name="API Documentation" href="1.5/api/"/>
-      </item>
-      <item name="Apache Tika 1.4" href="1.4/index.html" collapse="true">
-        <item name="Getting Started" href="1.4/gettingstarted.html"/>
-        <item name="Supported Formats" href="1.4/formats.html"/>
-        <item name="Parser API" href="1.4/parser.html"/>
-        <item name="Parser 5min Quick Start Guide" 
href="1.4/parser_guide.html"/>
-        <item name="Content and Language Detection" href="1.4/detection.html"/>
-        <item name="API Documentation" href="1.4/api/"/>
-      </item>
-      <item name="Apache Tika 1.3" href="1.3/index.html" collapse="true">
-        <item name="Getting Started" href="1.3/gettingstarted.html"/>
-        <item name="Supported Formats" href="1.3/formats.html"/>
-        <item name="Parser API" href="1.3/parser.html"/>
-        <item name="Parser 5min Quick Start Guide" 
href="1.3/parser_guide.html"/>
-        <item name="Content and Language Detection" href="1.3/detection.html"/>
-        <item name="API Documentation" href="1.3/api/"/>
-      </item>
-      <item name="Apache Tika 1.2" href="1.2/index.html" collapse="true">
-        <item name="Getting Started" href="1.2/gettingstarted.html"/>
-        <item name="Supported Formats" href="1.2/formats.html"/>
-        <item name="Parser API" href="1.2/parser.html"/>
-        <item name="Parser 5min Quick Start Guide" 
href="1.2/parser_guide.html"/>
-        <item name="Content and Language Detection" href="1.2/detection.html"/>
-        <item name="API Documentation" href="1.2/api/"/>
-      </item>
-      <item name="Apache Tika 1.1" href="1.1/index.html" collapse="true">
-        <item name="Getting Started" href="1.1/gettingstarted.html"/>
-        <item name="Supported Formats" href="1.1/formats.html"/>
-        <item name="Parser API" href="1.1/parser.html"/>
-        <item name="Parser 5min Quick Start Guide" 
href="1.1/parser_guide.html"/>
-        <item name="Content and Language Detection" href="1.1/detection.html"/>
-        <item name="API Documentation" href="1.1/api/"/>
-      </item>
     </menu>
     <menu name="The Apache Software Foundation">
       <item name="About" href="http://www.apache.org/foundation/"/>

svn commit: r1710493 [15/15] - in /tika/site: ./ publish/ publish/0.10/ publish/0.5/ publish/0.6/ publish/0.7/ publish/0.8/ publish/0.9/ publish/1.0/ publish/1.1/ publish/1.10/ publish/1.11/ publish/1.2/ publish/1.3/ publish/1.4/ publish/1.5/ publish/1...

Reply via email to