Nick I'm working on this cam u hold off?

Sent from my iPhone

> On Sep 5, 2014, at 12:15 PM, "n...@apache.org" <n...@apache.org> wrote:
> 
> Author: nick
> Date: Fri Sep  5 19:14:58 2014
> New Revision: 1622762
> 
> URL: http://svn.apache.org/r1622762
> Log:
> Republish the site
> 
> Added:
>    tika/site/publish/1.6/detection.html
>    tika/site/publish/1.6/gettingstarted.html
>    tika/site/publish/1.6/parser.html
>    tika/site/publish/1.6/parser_guide.html
>    tika/site/publish/1.7/
>    tika/site/publish/1.7/examples.html
>    tika/site/publish/1.7/formats.html
> Modified:
>    tika/site/publish/1.4/gettingstarted.html
>    tika/site/publish/1.5/gettingstarted.html
>    tika/site/publish/1.6/formats.html
>    tika/site/publish/index.html
> 
> Modified: tika/site/publish/1.4/gettingstarted.html
> URL: 
> http://svn.apache.org/viewvc/tika/site/publish/1.4/gettingstarted.html?rev=1622762&r1=1622761&r2=1622762&view=diff
> ==============================================================================
> --- tika/site/publish/1.4/gettingstarted.html (original)
> +++ tika/site/publish/1.4/gettingstarted.html Fri Sep  5 19:14:58 2014
> @@ -94,13 +94,13 @@
> <div>
> <pre>mvn install</pre></div>
> <p>See the Maven documentation for more information about the available build 
> options.</p>
> -<p>Note that you need Java 5 or higher to build Tika.</p></div>
> +<p>Note that you need Java 6 or higher to build Tika.</p></div>
> <div class="section">
> <h2>Build artifacts<a name="Build_artifacts"></a></h2>
> <p>The Tika build consists of a number of components and produces the 
> following main binaries:</p>
> <dl>
> <dt>tika-core/target/tika-core-*.jar</dt>
> -<dd> Tika core library. Contains the core interfaces and classes of Tika, 
> but none of the parser implementations. Depends only on Java 5.</dd>
> +<dd> Tika core library. Contains the core interfaces and classes of Tika, 
> but none of the parser implementations. Depends only on Java 6.</dd>
> <dt>tika-parsers/target/tika-parsers-*.jar</dt>
> <dd> Tika parsers. Collection of classes that implement the Tika Parser 
> interface based on various external parser libraries.</dd>
> <dt>tika-app/target/tika-app-*.jar</dt>
> 
> Modified: tika/site/publish/1.5/gettingstarted.html
> URL: 
> http://svn.apache.org/viewvc/tika/site/publish/1.5/gettingstarted.html?rev=1622762&r1=1622761&r2=1622762&view=diff
> ==============================================================================
> --- tika/site/publish/1.5/gettingstarted.html (original)
> +++ tika/site/publish/1.5/gettingstarted.html Fri Sep  5 19:14:58 2014
> @@ -94,13 +94,13 @@
> <div>
> <pre>mvn install</pre></div>
> <p>See the Maven documentation for more information about the available build 
> options.</p>
> -<p>Note that you need Java 5 or higher to build Tika.</p></div>
> +<p>Note that you need Java 6 or higher to build Tika.</p></div>
> <div class="section">
> <h2>Build artifacts<a name="Build_artifacts"></a></h2>
> <p>The Tika build consists of a number of components and produces the 
> following main binaries:</p>
> <dl>
> <dt>tika-core/target/tika-core-*.jar</dt>
> -<dd> Tika core library. Contains the core interfaces and classes of Tika, 
> but none of the parser implementations. Depends only on Java 5.</dd>
> +<dd> Tika core library. Contains the core interfaces and classes of Tika, 
> but none of the parser implementations. Depends only on Java 6.</dd>
> <dt>tika-parsers/target/tika-parsers-*.jar</dt>
> <dd> Tika parsers. Collection of classes that implement the Tika Parser 
> interface based on various external parser libraries.</dd>
> <dt>tika-app/target/tika-app-*.jar</dt>
> 
> Added: tika/site/publish/1.6/detection.html
> URL: 
> http://svn.apache.org/viewvc/tika/site/publish/1.6/detection.html?rev=1622762&view=auto
> ==============================================================================
> --- tika/site/publish/1.6/detection.html (added)
> +++ tika/site/publish/1.6/detection.html Fri Sep  5 19:14:58 2014
> @@ -0,0 +1,357 @@
> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> +          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> +
> +<!--
> +  Licensed to the Apache Software Foundation (ASF) under one
> +  or more contributor license agreements.  See the NOTICE file
> +  distributed with this work for additional information
> +  regarding copyright ownership.  The ASF licenses this file
> +  to you under the Apache License, Version 2.0 (the
> +  "License"); you may not use this file except in compliance
> +  with the License.  You may obtain a copy of the License at
> +
> +    http://www.apache.org/licenses/LICENSE-2.0
> + 
> +  Unless required by applicable law or agreed to in writing,
> +  software distributed under the License is distributed on an
> +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> +  KIND, either express or implied.  See the License for the
> +  specific language governing permissions and limitations
> +  under the License.
> +-->
> +
> +
> +
> +
> +
> +
> +
> +<html xmlns="http://www.w3.org/1999/xhtml";>
> +  <head>
> +    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
> +    <title>Apache Tika - Content Detection</title>
> +    <style type="text/css" media="all">
> +      @import url("../css/site.css");
> +    </style>
> +    <link rel="icon" type="image/png" href="../tikaNoText16.png" />
> +    <script type="text/javascript">
> +      function selectProvider(form) {
> +        provider = form.elements['searchProvider'].value;
> +        if (provider == "any") {
> +          if (Math.random() > 0.5) {
> +            provider = "lucid";
> +          } else {
> +            provider = "sl";
> +          }
> +        }
> +        if (provider == "lucid") {
> +          form.action = "http://find.searchhub.org/p:tika";;
> +        } else if (provider == "sl") {
> +          form.action = "http://search-lucene.com/tika";;
> +        }
> +        days = 90;
> +        date = new Date();
> +        date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
> +        expires = "; expires=" + date.toGMTString();
> +        document.cookie = "searchProvider=" + provider + expires + "; 
> path=/";
> +      }
> +      function initProvider() {
> +        if (document.cookie.length>0) {
> +          cStart=document.cookie.indexOf("searchProvider=");
> +          if (cStart!=-1) {
> +            cStart=cStart + "searchProvider=".length;
> +            cEnd=document.cookie.indexOf(";", cStart);
> +            if (cEnd==-1) {
> +              cEnd=document.cookie.length;
> +            }
> +            provider = unescape(document.cookie.substring(cStart,cEnd));
> +            document.forms['searchform'].elements['searchProvider'].value = 
> provider;
> +          }
> +        }
> +        document.forms['searchform'].elements['q'].focus();
> +      }
> +    </script>
> +  </head>
> +  <body onLoad="initProvider();">
> +    <div id="body">
> +      <div id="banner">
> +        <a href="http://tika.apache.org"; id="bannerLeft" title="Apache Tika"
> +          ><img src="http://tika.apache.org/tika.png"; alt="Apache Tika"
> +                width="292" height="100"/></a>
> +        <a href="http://www.apache.org/"; id="bannerRight"
> +           title="The Apache Software Foundation"
> +          ><img src="http://tika.apache.org/asf-logo.gif"; alt="The Apache 
> Software Foundation"
> +                width="387" height="100"/></a>
> +      </div>
> +      <div id="content">
> +        <!-- Licensed to the Apache Software Foundation (ASF) under one or 
> more --><!-- contributor license agreements.  See the NOTICE file distributed 
> with --><!-- this work for additional information regarding copyright 
> ownership. --><!-- The ASF licenses this file to You under the Apache 
> License, Version 2.0 --><!-- (the "License"); you may not use this file 
> except in compliance with --><!-- the License.  You may obtain a copy of the 
> License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 
> --><!--  --><!-- Unless required by applicable law or agreed to in writing, 
> software --><!-- distributed under the License is distributed on an "AS IS" 
> BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express 
> or implied. --><!-- See the License for the specific language governing 
> permissions and --><!-- limitations under the License. --><div 
> class="section">
> +<h2>Content Detection<a name="Content_Detection"></a></h2>
> +<p>This page gives you information on how content and language detection 
> works with Apache Tika, and how to tune the behaviour of Tika.</p>
> +<ul>
> +<li><a href="#Content_Detection">Content Detection</a>
> +<ul>
> +<li><a href="#The_Detector_Interface">The Detector Interface</a></li>
> +<li><a href="#Mime_Magic_Detction">Mime Magic Detction</a></li>
> +<li><a href="#Resource_Name_Based_Detection">Resource Name Based 
> Detection</a></li>
> +<li><a href="#Known_Content_Type_Detection">Known Content Type 
> &quot;Detection</a></li>
> +<li><a href="#The_default_Mime_Types_Detector">The default Mime Types 
> Detector</a></li>
> +<li><a href="#Container_Aware_Detection">Container Aware Detection</a></li>
> +<li><a href="#The_default_Tika_Detector">The default Tika Detector</a></li>
> +<li><a href="#Ways_of_triggering_Detection">Ways of triggering 
> Detection</a></li>
> +<li><a href="#Language_Detection">Language Detection</a></li></ul></li></ul>
> +<div class="section">
> +<h3><a name="The_Detector_Interface">The Detector Interface</a></h3>
> +<p>The <a 
> href="./api/org/apache/tika/detect/Detector.html">org.apache.tika.detect.Detector</a>
>  interface is the basis for most of the content type detection in Apache 
> Tika. All the different ways of detecting content all implement the same 
> common method:</p>
> +<div>
> +<pre>MediaType detect(java.io.InputStream input,
> +                 Metadata metadata) throws java.io.IOException</pre></div>
> +<p>The <tt>detect</tt> method takes the stream to inspect, and a 
> <tt>Metadata</tt> object that holds any additional information on the 
> content. The detector will return a <a 
> href="./api/org/apache/tika/mime/MediaType.html">MediaType</a> object 
> describing its best guess as to the type of the file.</p>
> +<p>In general, only two keys on the Metadata object are used by Detectors. 
> These are <tt>Metadata.RESOURCE_NAME_KEY</tt> which should hold the name of 
> the file (where known), and <tt>Metadata.CONTENT_TYPE</tt> which should hold 
> the advertised content type of the file (eg from a webserver or a content 
> repository).</p></div>
> +<div class="section">
> +<h3><a name="Mime_Magic_Detction">Mime Magic Detction</a></h3>
> +<p>By looking for special (&quot;magic&quot;) patterns of bytes near the 
> start of the file, it is often possible to detect the type of the file. For 
> some file types, this is a simple process. For others, typically container 
> based formats, the magic detection may not be enough. (More detail on 
> detecting container formats below)</p>
> +<p>Tika is able to make use of a a mime magic info file, in the <a 
> class="externalLink" 
> href="http://www.freedesktop.org/standards/shared-mime-info";>Freedesktop 
> MIME-info</a> format to peform mime magic detection. (Note that Tika supports 
> a few more match types than Freedesktop does)</p>
> +<p>This is provided within Tika by <a 
> href="./api/org/apache/tika/detect/MagicDetector.html">org.apache.tika.detect.MagicDetector</a>.
>  It is most commonly access via <a 
> href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.MimeTypes</a>,
>  normally sourced from the <tt>tika-mimetypes.xml</tt> and 
> <tt>custom-mimetypes.xml</tt> files. For more information on defining your 
> own custom mimetypes, see <a 
> href="./parser_guide.html#Add_your_MIME-Type">the new parser 
> guide</a>.</p></div>
> +<div class="section">
> +<h3><a name="Resource_Name_Based_Detection">Resource Name Based 
> Detection</a></h3>
> +<p>Where the name of the file is known, it is sometimes possible to guess 
> the file type from the name or extension. Within the 
> <tt>tika-mimetypes.xml</tt> file is a list of patterns which are used to 
> identify the type from the filename.</p>
> +<p>However, because files may be renamed, this method of detection is quick 
> but not always as accurate.</p>
> +<p>This is provided within Tika by <a 
> href="./api/org/apache/tika/detect/NameDetector.html">org.apache.tika.detect.NameDetector</a>.</p></div>
> +<div class="section">
> +<h3><a name="Known_Content_Type_Detection">Known Content Type 
> &quot;Detection</a></h3>
> +<p>Sometimes, the mime type for a file is already known, such as when 
> downloading from a webserver, or when retrieving from a content store. This 
> information can be used by detectors, such as <a 
> href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.MimeTypes</a>,</p></div>
> +<div class="section">
> +<h3><a name="The_default_Mime_Types_Detector">The default Mime Types 
> Detector</a></h3>
> +<p>By default, the mime type detection in Tika is provided by <a 
> href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.MimeTypes</a>.
>  This detector makes use of <tt>tika-mimetypes.xml</tt> to power magic based 
> and filename based detection.</p>
> +<p>Firstly, magic based detection is used on the start of the file. If the 
> file is an XML file, then the start of the XML is processed to look for root 
> elements. Next, if available, the filename (from 
> <tt>Metadata.RESOURCE_NAME_KEY</tt>) is then used to improve the detail of 
> the detection, such as when magic detects a text file, and the filename hints 
> it's really a CSV. Finally, if available, the supplied content type (from 
> <tt>Metadata.CONTENT_TYPE</tt>) is used to further refine the type.</p></div>
> +<div class="section">
> +<h3><a name="Container_Aware_Detection">Container Aware Detection</a></h3>
> +<p>Several common file formats are actually held within a common container 
> format. One example is the PowerPoint .ppt and Word .doc formats, which are 
> both held within an OLE2 container. Another is Apple iWork formats, which are 
> actually a series of XML files within a Zip file.</p>
> +<p>Using magic detection, it is easy to spot that a given file is an OLE2 
> document, or a Zip file. Using magic detection alone, it is very difficult 
> (and often impossible) to tell what kind of file lives inside the 
> container.</p>
> +<p>For some use cases, speed is important, so having a quick way to know the 
> container type is sufficient. For other cases however, you don't mind 
> spending a bit of time (and memory!) processing the container to get a more 
> accurate answer on its contents. For these cases, the additional container 
> aware detectors contained in the <tt>Tika Parsers</tt> jar should be used.</p>
> +<p>Tika provides a wrapping detector in the form of <a 
> href="./api/org/apache/tika/detect/DefaultDetector.html">org.apache.tika.detect.DefaultDetector</a>.
>  This uses the service loader to discover all available detectors, including 
> any available container aware ones, and tries them in turn. For container 
> aware detection, include the <tt>Tika Parsers</tt> jar and its dependencies 
> in your project, then use DefaultDetector along with a 
> <tt>TikaInputStream</tt>.</p>
> +<p>Because these container detectors needs to read the whole file to open 
> and inspect the container, they must be used with a <a 
> href="./api/org/apache/tika/io/TikaInputStream.html">org.apache.tika.io.TikaInputStream</a>.
>  If called with a regular <tt>InputStream</tt>, then all work will be done by 
> the default Mime Magic detection only.</p>
> +<p>For more information on container formats and Tika, see <a 
> class="externalLink" 
> href="http://wiki.apache.org/tika/MetadataDiscussion";></a></p></div>
> +<div class="section">
> +<h3><a name="The_default_Tika_Detector">The default Tika Detector</a></h3>
> +<p>Just as with Parsers, Tika provides a special detector <a 
> href="./api/org/apache/tika/detect/DefaultDetector.html">org.apache.tika.detect.DefaultDetector</a>
>  which auto-detects (based on service files) the available detectors at 
> runtime, and tries these in turn to identify the file type.</p>
> +<p>If only <tt>Tika Core</tt> is available, the Default Detector will work 
> only with Mime Magic and Resource Name detection. However, if <tt>Tika 
> Parsers</tt> (and its dependencies!) are available, additional detectors 
> which known about containers (such as zip and ole2) will be used as 
> appropriate, provided that detection is being performed with a <a 
> href="./api/org/apache/tika/io/TikaInputStream.html">org.apache.tika.io.TikaInputStream</a>.
>  Custom detectors can also be used as desired, they simply need to be listed 
> in a service file much as is done for <a 
> href="./parser_guide.html#List_the_new_parser">custom parsers</a>.</p></div>
> +<div class="section">
> +<h3><a name="Ways_of_triggering_Detection">Ways of triggering 
> Detection</a></h3>
> +<p>The simplest way to detect is through the <a 
> href="./api/org/apache/tika/Tika.html">Tika Facade class</a>, which provides 
> methods to detect based on <a 
> href="./api/org/apache/tika/Tika.html#detect(java.io.File)">File</a>, <a 
> href="./api/org/apache/tika/Tika.html#detect(java.io.InputStream)">InputStream</a>,
>  <a href="./api/org/apache/tika/Tika.html#detect(java.io.InputStream, 
> java.lang.String)">InputStream and Filename</a>, <a 
> href="./api/org/apache/tika/Tika.html#detect(java.lang.String)">Filename</a> 
> or a few others. It works best with a File or <a 
> href="./api/org/apache/tika/io/TikaInputStream.html">TikaInputStream</a>.</p>
> +<p>Alternately, detection can be performed on a specific Detector, or using 
> <tt>DefaultDetector</tt> to have all available Detectors used. A typical 
> pattern would be something like:</p>
> +<div>
> +<pre>TikaConfig tika = new TikaConfig();
> +
> +for (File f : myListOfFiles) {
> +   Metadata metadata = new Metadata();
> +   metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString());
> +   String mimetype = tika.getDetector().detect(
> +        TikaInputStream.get(f), metadata);
> +   System.out.println(&quot;File &quot; + f + &quot; is &quot; + mimetype);
> +}
> +for (InputStream is : myListOfStreams) {
> +   String mimetype = tika.getDetector().detect(
> +        TikaInputStream.get(is), new Metadata());
> +   System.out.println(&quot;Stream &quot; + is + &quot; is &quot; + 
> mimetype);
> +}</pre></div></div>
> +<div class="section">
> +<h3><a name="Language_Detection">Language Detection</a></h3>
> +<p>Tika is able to help identify the language of a piece of text, which is 
> useful when extracting text from document formats which do not include 
> language information in their metadata.</p>
> +<p>The language detection is provided by <a 
> href="./api/org/apache/tika/language/LanguageIdentifier.html">org.apache.tika.language.LanguageIdentifier</a></p></div></div>
> +      </div>
> +      <div id="sidebar">
> +        <div id="navigation">
> +                    <h5>Apache Tika</h5>
> +            <ul>
> +              
> +    <li class="none">
> +                    <a href="../index.html">Introduction</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../download.html">Download</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../contribute.html">Contribute</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../mail-lists.html">Mailing Lists</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://wiki.apache.org/tika/"; 
> class="externalLink">Tika Wiki</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="https://issues.apache.org/jira/browse/TIKA"; 
> class="externalLink">Issue Tracker</a>
> +          </li>
> +          </ul>
> +              <h5>Documentation</h5>
> +            <ul>
> +              
> +          
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="expanded">
> +                    <a href="../1.5/index.html">Apache Tika 1.5</a>
> +                  <ul>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/gettingstarted.html">Getting Started</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/formats.html">Supported Formats</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/parser.html">Parser API</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/parser_guide.html">Parser 5min Quick 
> Start Guide</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/detection.html">Content and Language 
> Detection</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/api/">API Documentation</a>
> +          </li>
> +              </ul>
> +        </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.4/index.html">Apache Tika 1.4</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.3/index.html">Apache Tika 1.3</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.2/index.html">Apache Tika 1.2</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.1/index.html">Apache Tika 1.1</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.0/index.html">Apache Tika 1.0</a>
> +                </li>
> +          </ul>
> +              <h5>The Apache Software Foundation</h5>
> +            <ul>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/foundation/"; 
> class="externalLink">About</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/licenses/"; 
> class="externalLink">License</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/security/"; 
> class="externalLink">Security</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a 
> href="http://www.apache.org/foundation/sponsorship.html"; 
> class="externalLink">Sponsorship</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/foundation/thanks.html"; 
> class="externalLink">Thanks</a>
> +          </li>
> +          </ul>
> +      
> +          <div id="search">
> +            <h5>Search with Apache Solr</h5>
> +            <form action="http://search.lucidimagination.com/p:tika";
> +                  method="get" id="searchform">
> +              <input type="text" id="query" name="q"/>
> +              <select name="searchProvider" id="searchProvider">
> +                <option value="any">provider</option>
> +                <option value="lucid">Lucid Find</option>
> +                <option value="sl">Search-Lucene</option>
> +              </select>
> +              <input type="submit" id="submit" value="Search" name="Search"
> +                     onclick="selectProvider(this.form)"/>
> +            </form>
> +          </div>
> +
> +          <div id="bookpromo">
> +            <h5>Books about Tika</h5>
> +            <p>
> +              <a href="http://manning.com/mattmann/"; title="Tika in Action"
> +                ><img src="../mattmann_cover150.jpg"
> +                      width="150" height="186"/></a>
> +            </p>
> +          </div>
> +        </div>
> +      </div>
> +      <div id="footer">
> +        <p>
> +          Copyright &#169; 2014
> +          <a href="http://www.apache.org/";>The Apache Software 
> Foundation</a>.
> +          Site powered by <a href="http://maven.apache.org/";>Apache 
> Maven</a>. 
> +          Search powered by
> +          <a href="http://www.lucidimagination.com";>Lucid Imagination</a>
> +          and <a href="http://sematext.com";>Sematext</a>.
> +          <br/>
> +          Apache Tika, Tika, Apache, the Apache feather logo, and the Apache
> +          Tika project logo are trademarks of The Apache Software Foundation.
> +        </p>
> +      </div>
> +    </div>
> +  </body>
> +</html>
> 
> Modified: tika/site/publish/1.6/formats.html
> URL: 
> http://svn.apache.org/viewvc/tika/site/publish/1.6/formats.html?rev=1622762&r1=1622761&r2=1622762&view=diff
> ==============================================================================
> --- tika/site/publish/1.6/formats.html (original)
> +++ tika/site/publish/1.6/formats.html Fri Sep  5 19:14:58 2014
> @@ -110,7 +110,9 @@
> <li><a href="#Mail_formats">Mail formats</a></li>
> <li><a href="#CAD_formats">CAD formats</a></li>
> <li><a href="#Font_formats">Font formats</a></li>
> -<li><a href="#Executable_programs_and_libraries">Executable programs and 
> libraries</a></li></ul></li></ul>
> +<li><a href="#Scientific_formats">Scientific formats</a></li>
> +<li><a href="#Executable_programs_and_libraries">Executable programs and 
> libraries</a></li>
> +<li><a href="#Crypto_formats">Crypto formats</a></li></ul></li></ul>
> <div class="section">
> <h3><a name="HyperText_Markup_Language">HyperText Markup Language</a></h3>
> <p>The HyperText Markup Language (HTML) is the lingua franca of the web. Tika 
> uses the <a class="externalLink" 
> href="http://home.ccil.org/~cowan/XML/tagsoup/";>TagSoup</a> library to 
> support virtually any kind of HTML found on the web. The output from the <a 
> href="./api/org/apache/tika/parser/html/HtmlParser.html">HtmlParser</a> class 
> is guaranteed to be well-formed and valid XHTML, and various heuristics are 
> used to prevent things like inline scripts from cluttering the extracted text 
> content.</p></div>
> @@ -131,7 +133,8 @@
> <p>The <a 
> href="./api/org/apache/tika/parser/pdf/PDFParser.html">PDFParser</a> class 
> parsers Portable Document Format (PDF) documents using the <a 
> class="externalLink" href="http://pdfbox.apache.org/";>Apache PDFBox</a> 
> library.</p></div>
> <div class="section">
> <h3><a name="Electronic_Publication_Format">Electronic Publication 
> Format</a></h3>
> -<p>The <a 
> href="./api/org/apache/tika/parser/epub/EpubParser.html">EpubParser</a> class 
> supports the Electronic Publication Format (EPUB) used for many digital 
> books.</p></div>
> +<p>The <a 
> href="./api/org/apache/tika/parser/epub/EpubParser.html">EpubParser</a> class 
> supports the Electronic Publication Format (EPUB) used for many digital 
> books.</p>
> +<p>The <a 
> href="./api/org/apache/tika/parser/xml/FictionBookParser.html">FictionBookParser</a>
>  class supports the xml-based Fiction Book publishing format.</p></div>
> <div class="section">
> <h3><a name="Rich_Text_Format">Rich Text Format</a></h3>
> <p>The <a 
> href="./api/org/apache/tika/parser/rtf/RTFParser.html">RTFParser</a> class 
> uses the standard javax.swing.text.rtf feature to extract text content from 
> Rich Text Format (RTF) documents.</p></div>
> @@ -143,7 +146,8 @@
> <p>Extracting text content from plain text files seems like a simple task 
> until you start thinking of all the possible character encodings. The <a 
> href="./api/org/apache/tika/parser/txt/TXTParser.html">TXTParser</a> class 
> uses encoding detection code from the <a class="externalLink" 
> href="http://site.icu-project.org/";>ICU</a> project to automatically detect 
> the character encoding of a text document.</p></div>
> <div class="section">
> <h3><a name="Feed_and_Syndication_formats">Feed and Syndication 
> formats</a></h3>
> -<p>The <a 
> href="./api/org/apache/tika/parser/feed/FeedParser.html">FeedParser</a> class 
> supports the RSS and Atom feed syndication formats.</p></div>
> +<p>The <a 
> href="./api/org/apache/tika/parser/feed/FeedParser.html">FeedParser</a> class 
> supports the RSS and Atom feed syndication formats.</p>
> +<p>The <a 
> href="./api/org/apache/tika/parser/iptc/IptcAnpaParser.html">IptcAnpaParser</a>
>  class supports the IPTC ANPA News Wire feed format.</p></div>
> <div class="section">
> <h3><a name="Help_formats">Help formats</a></h3>
> <p>The <a 
> href="./api/org/apache/tika/parser/chm/ChmParser.html">ChmParser</a> class 
> supports the CHM Help format.</p></div>
> @@ -167,6 +171,7 @@
> <div class="section">
> <h3><a name="Mail_formats">Mail formats</a></h3>
> <p>The <a 
> href="./api/org/apache/tika/parser/mbox/MboxParser.html">MboxParser</a> can 
> extract email messages from the mbox format used by many email archives and 
> Unix-style mailboxes.</p>
> +<p>The <a 
> href="./api/org/apache/tika/parser/mail/RFC822Parser.html">RFC822Parser</a> 
> can process single email messages in the RFC 822 format used by many email 
> clients in their archives / exports.</p>
> <p>The <a 
> href="./api/org/apache/tika/parser/mbox/PSTParser.html">PSDParser</a> can 
> extract email messages from the Microsoft Outlook PST email format.</p></div>
> <div class="section">
> <h3><a name="CAD_formats">CAD formats</a></h3>
> @@ -175,8 +180,16 @@
> <h3><a name="Font_formats">Font formats</a></h3>
> <p>The <a 
> href="./api/org/apache/tika/parser/font/TrueTypeParser.html">TrueTypeParser</a>
>  class can extract simple metadata from the TrueType font format. The <a 
> href="./api/org/apache/tika/parser/font/AdobeFontMetricParser.html">AdobeFontMetricParser</a>
>  class does something similar for Adobe Font Metrics files.</p></div>
> <div class="section">
> +<h3><a name="Scientific_formats">Scientific formats</a></h3>
> +<p>The <a 
> href="./api/org/apache/tika/parser/hdf/HDFParser.html">HDFParser</a> is able 
> to extract attribute metadata from the HDF scientific file format.</p>
> +<p>The <a 
> href="./api/org/apache/tika/parser/netcdf/NetCDFParser.html">NetCDFParser</a> 
> is able to extract attribute metadata from the NetCDF scientific file 
> format.</p>
> +<p>The <a 
> href="./api/org/apache/tika/parser/mat/MatParser.html">MatParser</a> is able 
> to extract attribute metadata from the Matlab scientific file 
> format.</p></div>
> +<div class="section">
> <h3><a name="Executable_programs_and_libraries">Executable programs and 
> libraries</a></h3>
> -<p>The <a 
> href="./api/org/apache/tika/parser/executable/ExecutableParser.html">ExecutableParser</a>
>  can extract metadata information on platforms, architectures and types from 
> a range of executable formats and libraries, such as Windows Executables and 
> Linux / BSD programs and libraries.</p></div></div>
> +<p>The <a 
> href="./api/org/apache/tika/parser/executable/ExecutableParser.html">ExecutableParser</a>
>  can extract metadata information on platforms, architectures and types from 
> a range of executable formats and libraries, such as Windows Executables and 
> Linux / BSD programs and libraries.</p></div>
> +<div class="section">
> +<h3><a name="Crypto_formats">Crypto formats</a></h3>
> +<p>The <a 
> href="./api/org/apache/tika/parser/crypto/Pkcs7Parser.html">Pkcs7Parser</a> 
> is able to parse the contents of PKCS7 signed messages, but doesn't include 
> any information from the outer PKCS7 wrapper.</p></div></div>
> <div class="section">
> <h2>Full list of supported formats:<a 
> name="Full_list_of_supported_formats:"></a></h2>
> <ul>
> @@ -270,6 +283,9 @@
> <li>org.apache.tika.parser.mail.<a 
> href="./api/org/apache/tika/parser/mail/RFC822Parser">RFC822Parser</a>
> <ul>
> <li>message/rfc822</li></ul></li>
> +<li>org.apache.tika.parser.mat.<a 
> href="./api/org/apache/tika/parser/mat/MatParser">MatParser</a>
> +<ul>
> +<li>application/x-matlab-data</li></ul></li>
> <li>org.apache.tika.parser.mbox.<a 
> href="./api/org/apache/tika/parser/mbox/MboxParser">MboxParser</a>
> <ul>
> <li>application/mbox</li></ul></li>
> 
> Added: tika/site/publish/1.6/gettingstarted.html
> URL: 
> http://svn.apache.org/viewvc/tika/site/publish/1.6/gettingstarted.html?rev=1622762&view=auto
> ==============================================================================
> --- tika/site/publish/1.6/gettingstarted.html (added)
> +++ tika/site/publish/1.6/gettingstarted.html Fri Sep  5 19:14:58 2014
> @@ -0,0 +1,413 @@
> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> +          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> +
> +<!--
> +  Licensed to the Apache Software Foundation (ASF) under one
> +  or more contributor license agreements.  See the NOTICE file
> +  distributed with this work for additional information
> +  regarding copyright ownership.  The ASF licenses this file
> +  to you under the Apache License, Version 2.0 (the
> +  "License"); you may not use this file except in compliance
> +  with the License.  You may obtain a copy of the License at
> +
> +    http://www.apache.org/licenses/LICENSE-2.0
> + 
> +  Unless required by applicable law or agreed to in writing,
> +  software distributed under the License is distributed on an
> +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> +  KIND, either express or implied.  See the License for the
> +  specific language governing permissions and limitations
> +  under the License.
> +-->
> +
> +
> +
> +
> +
> +
> +
> +<html xmlns="http://www.w3.org/1999/xhtml";>
> +  <head>
> +    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
> +    <title>Apache Tika - Getting Started with Apache Tika</title>
> +    <style type="text/css" media="all">
> +      @import url("../css/site.css");
> +    </style>
> +    <link rel="icon" type="image/png" href="../tikaNoText16.png" />
> +    <script type="text/javascript">
> +      function selectProvider(form) {
> +        provider = form.elements['searchProvider'].value;
> +        if (provider == "any") {
> +          if (Math.random() > 0.5) {
> +            provider = "lucid";
> +          } else {
> +            provider = "sl";
> +          }
> +        }
> +        if (provider == "lucid") {
> +          form.action = "http://find.searchhub.org/p:tika";;
> +        } else if (provider == "sl") {
> +          form.action = "http://search-lucene.com/tika";;
> +        }
> +        days = 90;
> +        date = new Date();
> +        date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
> +        expires = "; expires=" + date.toGMTString();
> +        document.cookie = "searchProvider=" + provider + expires + "; 
> path=/";
> +      }
> +      function initProvider() {
> +        if (document.cookie.length>0) {
> +          cStart=document.cookie.indexOf("searchProvider=");
> +          if (cStart!=-1) {
> +            cStart=cStart + "searchProvider=".length;
> +            cEnd=document.cookie.indexOf(";", cStart);
> +            if (cEnd==-1) {
> +              cEnd=document.cookie.length;
> +            }
> +            provider = unescape(document.cookie.substring(cStart,cEnd));
> +            document.forms['searchform'].elements['searchProvider'].value = 
> provider;
> +          }
> +        }
> +        document.forms['searchform'].elements['q'].focus();
> +      }
> +    </script>
> +  </head>
> +  <body onLoad="initProvider();">
> +    <div id="body">
> +      <div id="banner">
> +        <a href="http://tika.apache.org"; id="bannerLeft" title="Apache Tika"
> +          ><img src="http://tika.apache.org/tika.png"; alt="Apache Tika"
> +                width="292" height="100"/></a>
> +        <a href="http://www.apache.org/"; id="bannerRight"
> +           title="The Apache Software Foundation"
> +          ><img src="http://tika.apache.org/asf-logo.gif"; alt="The Apache 
> Software Foundation"
> +                width="387" height="100"/></a>
> +      </div>
> +      <div id="content">
> +        <!-- Licensed to the Apache Software Foundation (ASF) under one or 
> more --><!-- contributor license agreements.  See the NOTICE file distributed 
> with --><!-- this work for additional information regarding copyright 
> ownership. --><!-- The ASF licenses this file to You under the Apache 
> License, Version 2.0 --><!-- (the "License"); you may not use this file 
> except in compliance with --><!-- the License.  You may obtain a copy of the 
> License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 
> --><!--  --><!-- Unless required by applicable law or agreed to in writing, 
> software --><!-- distributed under the License is distributed on an "AS IS" 
> BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express 
> or implied. --><!-- See the License for the specific language governing 
> permissions and --><!-- limitations under the License. --><div 
> class="section">
> +<h2>Getting Started with Apache Tika<a 
> name="Getting_Started_with_Apache_Tika"></a></h2>
> +<p>This document describes how to build Apache Tika from sources and how to 
> start using Tika in an application.</p></div>
> +<div class="section">
> +<h2>Getting and building the sources<a 
> name="Getting_and_building_the_sources"></a></h2>
> +<p>To build Tika from sources you first need to either <a 
> href="../download.html">download</a> a source release or <a 
> href="../source-repository.html">checkout</a> the latest sources from version 
> control.</p>
> +<p>Once you have the sources, you can build them using the <a 
> class="externalLink" href="http://maven.apache.org/";>Maven 2</a> build 
> system. Executing the following command in the base directory will build the 
> sources and install the resulting artifacts in your local Maven 
> repository.</p>
> +<div>
> +<pre>mvn install</pre></div>
> +<p>See the Maven documentation for more information about the available 
> build options.</p>
> +<p>Note that you need Java 6 or higher to build Tika.</p></div>
> +<div class="section">
> +<h2>Build artifacts<a name="Build_artifacts"></a></h2>
> +<p>The Tika build consists of a number of components and produces the 
> following main binaries:</p>
> +<dl>
> +<dt>tika-core/target/tika-core-*.jar</dt>
> +<dd> Tika core library. Contains the core interfaces and classes of Tika, 
> but none of the parser implementations. Depends only on Java 6.</dd>
> +<dt>tika-parsers/target/tika-parsers-*.jar</dt>
> +<dd> Tika parsers. Collection of classes that implement the Tika Parser 
> interface based on various external parser libraries.</dd>
> +<dt>tika-app/target/tika-app-*.jar</dt>
> +<dd> Tika application. Combines the above components and all the external 
> parser libraries into a single runnable jar with a GUI and a command line 
> interface.</dd>
> +<dt>tika-bundle/target/tika-bundle-*.jar</dt>
> +<dd> Tika bundle. An OSGi bundle that combines tika-parsers with 
> non-OSGified parser libraries to make them easy to deploy in an OSGi 
> environment.</dd></dl></div>
> +<div class="section">
> +<h2>Using Tika as a Maven dependency<a 
> name="Using_Tika_as_a_Maven_dependency"></a></h2>
> +<p>The core library, tika-core, contains the key interfaces and classes of 
> Tika and can be used by itself if you don't need the full set of parsers from 
> the tika-parsers component. The tika-core dependency looks like this:</p>
> +<div>
> +<pre>  &lt;dependency&gt;
> +    &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
> +    &lt;artifactId&gt;tika-core&lt;/artifactId&gt;
> +    &lt;version&gt;...&lt;/version&gt;
> +  &lt;/dependency&gt;</pre></div>
> +<p>If you want to use Tika to parse documents (instead of simply detecting 
> document types, etc.), you'll want to depend on tika-parsers instead: </p>
> +<div>
> +<pre>  &lt;dependency&gt;
> +    &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
> +    &lt;artifactId&gt;tika-parsers&lt;/artifactId&gt;
> +    &lt;version&gt;...&lt;/version&gt;
> +  &lt;/dependency&gt;</pre></div>
> +<p>Note that adding this dependency will introduce a number of transitive 
> dependencies to your project, including one on tika-core. You need to make 
> sure that these dependencies won't conflict with your existing project 
> dependencies. You can use the following command in the tika-parsers directory 
> to get a full listing of all the dependencies.</p>
> +<div>
> +<pre>$ mvn dependency:tree | grep :compile</pre></div></div>
> +<div class="section">
> +<h2>Using Tika in an Ant project<a 
> name="Using_Tika_in_an_Ant_project"></a></h2>
> +<p>Unless you use a dependency manager tool like <a class="externalLink" 
> href="http://ant.apache.org/ivy/";>Apache Ivy</a>, the easiest way to use Tika 
> is to include either the tika-core or the tika-app jar in your classpath, 
> depending on whether you want just the core functionality or also all the 
> parser implementations.</p>
> +<div>
> +<pre>&lt;classpath&gt;
> +  ... &lt;!-- your other classpath entries --&gt;
> +
> +  &lt;!-- either: --&gt;
> +  &lt;pathelement 
> location=&quot;path/to/tika-core-${tika.version}.jar&quot;/&gt;
> +  &lt;!-- or: --&gt;
> +  &lt;pathelement 
> location=&quot;path/to/tika-app-${tika.version}.jar&quot;/&gt;
> +
> +&lt;/classpath&gt;</pre></div></div>
> +<div class="section">
> +<h2>Using Tika as a command line utility<a 
> name="Using_Tika_as_a_command_line_utility"></a></h2>
> +<p>The Tika application jar (tika-app-*.jar) can be used as a command line 
> utility for extracting text content and metadata from all sorts of files. 
> This runnable jar contains all the dependencies it needs, so you don't need 
> to worry about classpath settings to run it.</p>
> +<p>The usage instructions are shown below.</p>
> +<div>
> +<pre>usage: java -jar tika-app.jar [option...] [file|port...]
> +
> +Options:
> +    -?  or --help          Print this usage message
> +    -v  or --verbose       Print debug level messages
> +    -V  or --version       Print the Apache Tika version number
> +
> +    -g  or --gui           Start the Apache Tika GUI
> +    -s  or --server        Start the Apache Tika server
> +    -f  or --fork          Use Fork Mode for out-of-process extraction
> +
> +    -x  or --xml           Output XHTML content (default)
> +    -h  or --html          Output HTML content
> +    -t  or --text          Output plain text content
> +    -T  or --text-main     Output plain text content (main content only)
> +    -m  or --metadata      Output only metadata
> +    -j  or --json          Output metadata in JSON
> +    -y  or --xmp           Output metadata in XMP
> +    -l  or --language      Output only language
> +    -d  or --detect        Detect document type
> +    -eX or --encoding=X    Use output encoding X
> +    -pX or --password=X    Use document password X
> +    -z  or --extract       Extract all attachements into current directory
> +    --extract-dir=&lt;dir&gt;    Specify target directory for -z
> +    -r  or --pretty-print  For XML and XHTML outputs, adds newlines and
> +                           whitespace, for better readability
> +
> +    --create-profile=X
> +         Create NGram profile, where X is a profile name
> +    --list-parsers
> +         List the available document parsers
> +    --list-parser-details
> +         List the available document parsers, and their supported mime types
> +    --list-detectors
> +         List the available document detectors
> +    --list-met-models
> +         List the available metadata models, and their supported keys
> +    --list-supported-types
> +         List all known media types and related information
> +
> +Description:
> +    Apache Tika will parse the file(s) specified on the
> +    command line and output the extracted text content
> +    or metadata to standard output.
> +
> +    Instead of a file name you can also specify the URL
> +    of a document to be parsed.
> +
> +    If no file name or URL is specified (or the special
> +    name &quot;-&quot; is used), then the standard input stream
> +    is parsed. If no arguments were given and no input
> +    data is available, the GUI is started instead.
> +
> +- GUI mode
> +
> +    Use the &quot;--gui&quot; (or &quot;-g&quot;) option to start the
> +    Apache Tika GUI. You can drag and drop files from
> +    a normal file explorer to the GUI window to extract
> +    text content and metadata from the files.
> +
> +- Server mode
> +
> +    Use the &quot;--server&quot; (or &quot;-s&quot;) option to start the
> +    Apache Tika server. The server will listen to the
> +    ports you specify as one or more arguments.</pre></div>
> +<p>You can also use the jar as a component in a Unix pipeline or as an 
> external tool in many scripting languages.</p>
> +<div>
> +<pre># Check if an Internet resource contains a specific keyword
> +curl http://.../document.doc \
> +  | java -jar tika-app.jar --text \
> +  | grep -q keyword</pre></div></div>
> +      </div>
> +      <div id="sidebar">
> +        <div id="navigation">
> +                    <h5>Apache Tika</h5>
> +            <ul>
> +              
> +    <li class="none">
> +                    <a href="../index.html">Introduction</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../download.html">Download</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../contribute.html">Contribute</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../mail-lists.html">Mailing Lists</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://wiki.apache.org/tika/"; 
> class="externalLink">Tika Wiki</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="https://issues.apache.org/jira/browse/TIKA"; 
> class="externalLink">Issue Tracker</a>
> +          </li>
> +          </ul>
> +              <h5>Documentation</h5>
> +            <ul>
> +              
> +          
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="expanded">
> +                    <a href="../1.5/index.html">Apache Tika 1.5</a>
> +                  <ul>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/gettingstarted.html">Getting Started</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/formats.html">Supported Formats</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/parser.html">Parser API</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/parser_guide.html">Parser 5min Quick 
> Start Guide</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/detection.html">Content and Language 
> Detection</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/api/">API Documentation</a>
> +          </li>
> +              </ul>
> +        </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.4/index.html">Apache Tika 1.4</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.3/index.html">Apache Tika 1.3</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.2/index.html">Apache Tika 1.2</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.1/index.html">Apache Tika 1.1</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.0/index.html">Apache Tika 1.0</a>
> +                </li>
> +          </ul>
> +              <h5>The Apache Software Foundation</h5>
> +            <ul>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/foundation/"; 
> class="externalLink">About</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/licenses/"; 
> class="externalLink">License</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/security/"; 
> class="externalLink">Security</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a 
> href="http://www.apache.org/foundation/sponsorship.html"; 
> class="externalLink">Sponsorship</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/foundation/thanks.html"; 
> class="externalLink">Thanks</a>
> +          </li>
> +          </ul>
> +      
> +          <div id="search">
> +            <h5>Search with Apache Solr</h5>
> +            <form action="http://search.lucidimagination.com/p:tika";
> +                  method="get" id="searchform">
> +              <input type="text" id="query" name="q"/>
> +              <select name="searchProvider" id="searchProvider">
> +                <option value="any">provider</option>
> +                <option value="lucid">Lucid Find</option>
> +                <option value="sl">Search-Lucene</option>
> +              </select>
> +              <input type="submit" id="submit" value="Search" name="Search"
> +                     onclick="selectProvider(this.form)"/>
> +            </form>
> +          </div>
> +
> +          <div id="bookpromo">
> +            <h5>Books about Tika</h5>
> +            <p>
> +              <a href="http://manning.com/mattmann/"; title="Tika in Action"
> +                ><img src="../mattmann_cover150.jpg"
> +                      width="150" height="186"/></a>
> +            </p>
> +          </div>
> +        </div>
> +      </div>
> +      <div id="footer">
> +        <p>
> +          Copyright &#169; 2014
> +          <a href="http://www.apache.org/";>The Apache Software 
> Foundation</a>.
> +          Site powered by <a href="http://maven.apache.org/";>Apache 
> Maven</a>. 
> +          Search powered by
> +          <a href="http://www.lucidimagination.com";>Lucid Imagination</a>
> +          and <a href="http://sematext.com";>Sematext</a>.
> +          <br/>
> +          Apache Tika, Tika, Apache, the Apache feather logo, and the Apache
> +          Tika project logo are trademarks of The Apache Software Foundation.
> +        </p>
> +      </div>
> +    </div>
> +  </body>
> +</html>
> 
> Added: tika/site/publish/1.6/parser.html
> URL: 
> http://svn.apache.org/viewvc/tika/site/publish/1.6/parser.html?rev=1622762&view=auto
> ==============================================================================
> --- tika/site/publish/1.6/parser.html (added)
> +++ tika/site/publish/1.6/parser.html Fri Sep  5 19:14:58 2014
> @@ -0,0 +1,372 @@
> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> +          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> +
> +<!--
> +  Licensed to the Apache Software Foundation (ASF) under one
> +  or more contributor license agreements.  See the NOTICE file
> +  distributed with this work for additional information
> +  regarding copyright ownership.  The ASF licenses this file
> +  to you under the Apache License, Version 2.0 (the
> +  "License"); you may not use this file except in compliance
> +  with the License.  You may obtain a copy of the License at
> +
> +    http://www.apache.org/licenses/LICENSE-2.0
> + 
> +  Unless required by applicable law or agreed to in writing,
> +  software distributed under the License is distributed on an
> +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> +  KIND, either express or implied.  See the License for the
> +  specific language governing permissions and limitations
> +  under the License.
> +-->
> +
> +
> +
> +
> +
> +
> +
> +<html xmlns="http://www.w3.org/1999/xhtml";>
> +  <head>
> +    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
> +    <title>Apache Tika - The Parser interface</title>
> +    <style type="text/css" media="all">
> +      @import url("../css/site.css");
> +    </style>
> +    <link rel="icon" type="image/png" href="../tikaNoText16.png" />
> +    <script type="text/javascript">
> +      function selectProvider(form) {
> +        provider = form.elements['searchProvider'].value;
> +        if (provider == "any") {
> +          if (Math.random() > 0.5) {
> +            provider = "lucid";
> +          } else {
> +            provider = "sl";
> +          }
> +        }
> +        if (provider == "lucid") {
> +          form.action = "http://find.searchhub.org/p:tika";;
> +        } else if (provider == "sl") {
> +          form.action = "http://search-lucene.com/tika";;
> +        }
> +        days = 90;
> +        date = new Date();
> +        date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
> +        expires = "; expires=" + date.toGMTString();
> +        document.cookie = "searchProvider=" + provider + expires + "; 
> path=/";
> +      }
> +      function initProvider() {
> +        if (document.cookie.length>0) {
> +          cStart=document.cookie.indexOf("searchProvider=");
> +          if (cStart!=-1) {
> +            cStart=cStart + "searchProvider=".length;
> +            cEnd=document.cookie.indexOf(";", cStart);
> +            if (cEnd==-1) {
> +              cEnd=document.cookie.length;
> +            }
> +            provider = unescape(document.cookie.substring(cStart,cEnd));
> +            document.forms['searchform'].elements['searchProvider'].value = 
> provider;
> +          }
> +        }
> +        document.forms['searchform'].elements['q'].focus();
> +      }
> +    </script>
> +  </head>
> +  <body onLoad="initProvider();">
> +    <div id="body">
> +      <div id="banner">
> +        <a href="http://tika.apache.org"; id="bannerLeft" title="Apache Tika"
> +          ><img src="http://tika.apache.org/tika.png"; alt="Apache Tika"
> +                width="292" height="100"/></a>
> +        <a href="http://www.apache.org/"; id="bannerRight"
> +           title="The Apache Software Foundation"
> +          ><img src="http://tika.apache.org/asf-logo.gif"; alt="The Apache 
> Software Foundation"
> +                width="387" height="100"/></a>
> +      </div>
> +      <div id="content">
> +        <!-- Licensed to the Apache Software Foundation (ASF) under one or 
> more --><!-- contributor license agreements.  See the NOTICE file distributed 
> with --><!-- this work for additional information regarding copyright 
> ownership. --><!-- The ASF licenses this file to You under the Apache 
> License, Version 2.0 --><!-- (the "License"); you may not use this file 
> except in compliance with --><!-- the License.  You may obtain a copy of the 
> License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 
> --><!--  --><!-- Unless required by applicable law or agreed to in writing, 
> software --><!-- distributed under the License is distributed on an "AS IS" 
> BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express 
> or implied. --><!-- See the License for the specific language governing 
> permissions and --><!-- limitations under the License. --><div 
> class="section">
> +<h2>The Parser interface<a name="The_Parser_interface"></a></h2>
> +<p>The <a 
> href="./api/org/apache/tika/parser/Parser.html">org.apache.tika.parser.Parser</a>
>  interface is the key concept of Apache Tika. It hides the complexity of 
> different file formats and parsing libraries while providing a simple and 
> powerful mechanism for client applications to extract structured text content 
> and metadata from all sorts of documents. All this is achieved with a single 
> method:</p>
> +<div>
> +<pre>void parse(
> +    InputStream stream, ContentHandler handler, Metadata metadata,
> +    ParseContext context) throws IOException, SAXException, 
> TikaException;</pre></div>
> +<p>The <tt>parse</tt> method takes the document to be parsed and related 
> metadata as input and outputs the results as XHTML SAX events and extra 
> metadata. The parse context argument is used to specify context information 
> (like the current local) that is not related to any individual document. The 
> main criteria that lead to this design were:</p>
> +<dl>
> +<dt>Streamed parsing</dt>
> +<dd>The interface should require neither the client application nor the 
> parser implementation to keep the full document content in memory or spooled 
> to disk. This allows even huge documents to be parsed without excessive 
> resource requirements.</dd>
> +<dt>Structured content</dt>
> +<dd>A parser implementation should be able to include structural information 
> (headings, links, etc.) in the extracted content. A client application can 
> use this information for example to better judge the relevance of different 
> parts of the parsed document.</dd>
> +<dt>Input metadata</dt>
> +<dd>A client application should be able to include metadata like the file 
> name or declared content type with the document to be parsed. The parser 
> implementation can use this information to better guide the parsing 
> process.</dd>
> +<dt>Output metadata</dt>
> +<dd>A parser implementation should be able to return document metadata in 
> addition to document content. Many document formats contain metadata like the 
> name of the author that may be useful to client applications.</dd>
> +<dt>Context sensitivity</dt>
> +<dd>While the default settings and behaviour of Tika parsers should work 
> well for most use cases, there are still situations where more fine-grained 
> control over the parsing process is desirable. It should be easy to inject 
> such context-specific information to the parsing process without breaking the 
> layers of abstraction.</dd></dl>
> +<p>These criteria are reflected in the arguments of the <tt>parse</tt> 
> method.</p>
> +<div class="section">
> +<h3>Document input stream<a name="Document_input_stream"></a></h3>
> +<p>The first argument is an <a class="externalLink" 
> href="http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html";>InputStream</a>
>  for reading the document to be parsed.</p>
> +<p>If this document stream can not be read, then parsing stops and the 
> thrown <a class="externalLink" 
> href="http://docs.oracle.com/javase/6/docs/api/java/io/IOException.html";>IOException</a>
>  is passed up to the client application. If the stream can be read but not 
> parsed (for example if the document is corrupted), then the parser throws a 
> <a 
> href="./api/org/apache/tika/exception/TikaException.html">TikaException</a>.</p>
> +<p>The parser implementation will consume this stream but <i>will not close 
> it</i>. Closing the stream is the responsibility of the client application 
> that opened it in the first place. The recommended pattern for using streams 
> with the <tt>parse</tt> method is:</p>
> +<div>
> +<pre>InputStream stream = ...;      // open the stream
> +try {
> +    parser.parse(stream, ...); // parse the stream
> +} finally {
> +    stream.close();            // close the stream
> +}</pre></div>
> +<p>Some document formats like the OLE2 Compound Document Format used by 
> Microsoft Office are best parsed as random access files. In such cases the 
> content of the input stream is automatically spooled to a temporary file that 
> gets removed once parsed. A future version of Tika may make it possible to 
> avoid this extra file if the input document is already a file in the local 
> file system. See <a class="externalLink" 
> href="https://issues.apache.org/jira/browse/TIKA-153";>TIKA-153</a> for the 
> status of this feature request.</p></div>
> +<div class="section">
> +<h3>XHTML SAX events<a name="XHTML_SAX_events"></a></h3>
> +<p>The parsed content of the document stream is returned to the client 
> application as a sequence of XHTML SAX events. XHTML is used to express 
> structured content of the document and SAX events enable streamed processing. 
> Note that the XHTML format is used here only to convey structural 
> information, not to render the documents for browsing!</p>
> +<p>The XHTML SAX events produced by the parser implementation are sent to a 
> <a class="externalLink" 
> href="http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html";>ContentHandler</a>
>  instance given to the <tt>parse</tt> method. If this the content handler 
> fails to process an event, then parsing stops and the thrown <a 
> class="externalLink" 
> href="http://docs.oracle.com/javase/6/docs/api/org/xml/sax/SAXException.html";>SAXException</a>
>  is passed up to the client application.</p>
> +<p>The overall structure of the generated event stream is (with indenting 
> added for clarity):</p>
> +<div>
> +<pre>&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
> +  &lt;head&gt;
> +    &lt;title&gt;...&lt;/title&gt;
> +  &lt;/head&gt;
> +  &lt;body&gt;
> +    ...
> +  &lt;/body&gt;
> +&lt;/html&gt;</pre></div>
> +<p>Parser implementations typically use the <a 
> href="./apidocs/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a>
>  utility class to generate the XHTML output.</p>
> +<p>Dealing with the raw SAX events can be a bit complex, so Apache Tika 
> comes with a number of utility classes that can be used to process and 
> convert the event stream to other representations.</p>
> +<p>For example, the <a 
> href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a>
>  class can be used to extract just the body part of the XHTML output and feed 
> it either as SAX events to another content handler or as characters to an 
> output stream, a writer, or simply a string. The following code snippet 
> parses a document from the standard input stream and outputs the extracted 
> text content to standard output:</p>
> +<div>
> +<pre>ContentHandler handler = new BodyContentHandler(System.out);
> +parser.parse(System.in, handler, ...);</pre></div>
> +<p>Another useful class is <a 
> href="./api/org/apache/tika/parser/ParsingReader.html">ParsingReader</a> that 
> uses a background thread to parse the document and returns the extracted text 
> content as a character stream:</p>
> +<div>
> +<pre>InputStream stream = ...; // the document to be parsed
> +Reader reader = new ParsingReader(parser, stream, ...);
> +try {
> +    ...;                  // read the document text using the reader
> +} finally {
> +    reader.close();       // the document stream is closed automatically
> +}</pre></div></div>
> +<div class="section">
> +<h3>Document metadata<a name="Document_metadata"></a></h3>
> +<p>The third argument to the <tt>parse</tt> method is used to pass document 
> metadata both in and out of the parser. Document metadata is expressed as an 
> <a href="./api/org/apache/tika/metadata/Metadata.html">Metadata</a> 
> object.</p>
> +<p>The following are some of the more interesting metadata properties:</p>
> +<dl>
> +<dt>Metadata.RESOURCE_NAME_KEY</dt>
> +<dd>The name of the file or resource that contains the document.
> +<p>A client application can set this property to allow the parser to use 
> file name heuristics to determine the format of the document.</p>
> +<p>The parser implementation may set this property if the file format 
> contains the canonical name of the file (for example the Gzip format has a 
> slot for the file name).</p></dd>
> +<dt>Metadata.CONTENT_TYPE</dt>
> +<dd>The declared content type of the document.
> +<p>A client application can set this property based on for example a HTTP 
> Content-Type header. The declared content type may help the parser to 
> correctly interpret the document.</p>
> +<p>The parser implementation sets this property to the content type 
> according to which the document was parsed.</p></dd>
> +<dt>Metadata.TITLE</dt>
> +<dd>The title of the document.
> +<p>The parser implementation sets this property if the document format 
> contains an explicit title field.</p></dd>
> +<dt>Metadata.AUTHOR</dt>
> +<dd>The name of the author of the document.
> +<p>The parser implementation sets this property if the document format 
> contains an explicit author field.</p></dd></dl>
> +<p>Note that metadata handling is still being discussed by the Tika 
> development team, and it is likely that there will be some (backwards 
> incompatible) changes in metadata handling before Tika 1.0.</p></div>
> +<div class="section">
> +<h3>Parse context<a name="Parse_context"></a></h3>
> +<p>The final argument to the <tt>parse</tt> method is used to inject 
> context-specific information to the parsing process. This is useful for 
> example when dealing with locale-specific date and number formats in 
> Microsoft Excel spreadsheets. Another important use of the parse context is 
> passing in the delegate parser instance to be used by two-phase parsers like 
> the <a 
> href="./api/org/apache/parser/pkg/PackageParser.html">PackageParser</a> 
> subclasses. Some parser classes allow customization of the parsing process 
> through strategy objects in the parse context.</p></div>
> +<div class="section">
> +<h3>Parser implementations<a name="Parser_implementations"></a></h3>
> +<p>Apache Tika comes with a number of parser classes for parsing <a 
> href="./formats.html">various document formats</a>. You can also extend Tika 
> with your own parsers, and of course any contributions to Tika are warmly 
> welcome.</p>
> +<p>The goal of Tika is to reuse existing parser libraries like <a 
> class="externalLink" href="http://www.pdfbox.org/";>PDFBox</a> or <a 
> class="externalLink" href="http://poi.apache.org/";>Apache POI</a> as much as 
> possible, and so most of the parser classes in Tika are adapters to such 
> external libraries.</p>
> +<p>Tika also contains some general purpose parser implementations that are 
> not targeted at any specific document formats. The most notable of these is 
> the <a 
> href="./apidocs/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a>
>  class that encapsulates all Tika functionality into a single parser that can 
> handle any types of documents. This parser will automatically determine the 
> type of the incoming document based on various heuristics and will then parse 
> the document accordingly.</p></div></div>
> +      </div>
> +      <div id="sidebar">
> +        <div id="navigation">
> +                    <h5>Apache Tika</h5>
> +            <ul>
> +              
> +    <li class="none">
> +                    <a href="../index.html">Introduction</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../download.html">Download</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../contribute.html">Contribute</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../mail-lists.html">Mailing Lists</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://wiki.apache.org/tika/"; 
> class="externalLink">Tika Wiki</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="https://issues.apache.org/jira/browse/TIKA"; 
> class="externalLink">Issue Tracker</a>
> +          </li>
> +          </ul>
> +              <h5>Documentation</h5>
> +            <ul>
> +              
> +          
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="expanded">
> +                    <a href="../1.5/index.html">Apache Tika 1.5</a>
> +                  <ul>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/gettingstarted.html">Getting Started</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/formats.html">Supported Formats</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/parser.html">Parser API</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/parser_guide.html">Parser 5min Quick 
> Start Guide</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/detection.html">Content and Language 
> Detection</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/api/">API Documentation</a>
> +          </li>
> +              </ul>
> +        </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.4/index.html">Apache Tika 1.4</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.3/index.html">Apache Tika 1.3</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.2/index.html">Apache Tika 1.2</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.1/index.html">Apache Tika 1.1</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.0/index.html">Apache Tika 1.0</a>
> +                </li>
> +          </ul>
> +              <h5>The Apache Software Foundation</h5>
> +            <ul>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/foundation/"; 
> class="externalLink">About</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/licenses/"; 
> class="externalLink">License</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/security/"; 
> class="externalLink">Security</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a 
> href="http://www.apache.org/foundation/sponsorship.html"; 
> class="externalLink">Sponsorship</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/foundation/thanks.html"; 
> class="externalLink">Thanks</a>
> +          </li>
> +          </ul>
> +      
> +          <div id="search">
> +            <h5>Search with Apache Solr</h5>
> +            <form action="http://search.lucidimagination.com/p:tika";
> +                  method="get" id="searchform">
> +              <input type="text" id="query" name="q"/>
> +              <select name="searchProvider" id="searchProvider">
> +                <option value="any">provider</option>
> +                <option value="lucid">Lucid Find</option>
> +                <option value="sl">Search-Lucene</option>
> +              </select>
> +              <input type="submit" id="submit" value="Search" name="Search"
> +                     onclick="selectProvider(this.form)"/>
> +            </form>
> +          </div>
> +
> +          <div id="bookpromo">
> +            <h5>Books about Tika</h5>
> +            <p>
> +              <a href="http://manning.com/mattmann/"; title="Tika in Action"
> +                ><img src="../mattmann_cover150.jpg"
> +                      width="150" height="186"/></a>
> +            </p>
> +          </div>
> +        </div>
> +      </div>
> +      <div id="footer">
> +        <p>
> +          Copyright &#169; 2014
> +          <a href="http://www.apache.org/";>The Apache Software 
> Foundation</a>.
> +          Site powered by <a href="http://maven.apache.org/";>Apache 
> Maven</a>. 
> +          Search powered by
> +          <a href="http://www.lucidimagination.com";>Lucid Imagination</a>
> +          and <a href="http://sematext.com";>Sematext</a>.
> +          <br/>
> +          Apache Tika, Tika, Apache, the Apache feather logo, and the Apache
> +          Tika project logo are trademarks of The Apache Software Foundation.
> +        </p>
> +      </div>
> +    </div>
> +  </body>
> +</html>
> 
> Added: tika/site/publish/1.6/parser_guide.html
> URL: 
> http://svn.apache.org/viewvc/tika/site/publish/1.6/parser_guide.html?rev=1622762&view=auto
> ==============================================================================
> --- tika/site/publish/1.6/parser_guide.html (added)
> +++ tika/site/publish/1.6/parser_guide.html Fri Sep  5 19:14:58 2014
> @@ -0,0 +1,373 @@
> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> +          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> +
> +<!--
> +  Licensed to the Apache Software Foundation (ASF) under one
> +  or more contributor license agreements.  See the NOTICE file
> +  distributed with this work for additional information
> +  regarding copyright ownership.  The ASF licenses this file
> +  to you under the Apache License, Version 2.0 (the
> +  "License"); you may not use this file except in compliance
> +  with the License.  You may obtain a copy of the License at
> +
> +    http://www.apache.org/licenses/LICENSE-2.0
> + 
> +  Unless required by applicable law or agreed to in writing,
> +  software distributed under the License is distributed on an
> +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> +  KIND, either express or implied.  See the License for the
> +  specific language governing permissions and limitations
> +  under the License.
> +-->
> +
> +
> +
> +
> +
> +
> +
> +<html xmlns="http://www.w3.org/1999/xhtml";>
> +  <head>
> +    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
> +    <title>Apache Tika - Get Tika parsing up and running in 5 minutes</title>
> +    <style type="text/css" media="all">
> +      @import url("../css/site.css");
> +    </style>
> +    <link rel="icon" type="image/png" href="../tikaNoText16.png" />
> +    <script type="text/javascript">
> +      function selectProvider(form) {
> +        provider = form.elements['searchProvider'].value;
> +        if (provider == "any") {
> +          if (Math.random() > 0.5) {
> +            provider = "lucid";
> +          } else {
> +            provider = "sl";
> +          }
> +        }
> +        if (provider == "lucid") {
> +          form.action = "http://find.searchhub.org/p:tika";;
> +        } else if (provider == "sl") {
> +          form.action = "http://search-lucene.com/tika";;
> +        }
> +        days = 90;
> +        date = new Date();
> +        date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
> +        expires = "; expires=" + date.toGMTString();
> +        document.cookie = "searchProvider=" + provider + expires + "; 
> path=/";
> +      }
> +      function initProvider() {
> +        if (document.cookie.length>0) {
> +          cStart=document.cookie.indexOf("searchProvider=");
> +          if (cStart!=-1) {
> +            cStart=cStart + "searchProvider=".length;
> +            cEnd=document.cookie.indexOf(";", cStart);
> +            if (cEnd==-1) {
> +              cEnd=document.cookie.length;
> +            }
> +            provider = unescape(document.cookie.substring(cStart,cEnd));
> +            document.forms['searchform'].elements['searchProvider'].value = 
> provider;
> +          }
> +        }
> +        document.forms['searchform'].elements['q'].focus();
> +      }
> +    </script>
> +  </head>
> +  <body onLoad="initProvider();">
> +    <div id="body">
> +      <div id="banner">
> +        <a href="http://tika.apache.org"; id="bannerLeft" title="Apache Tika"
> +          ><img src="http://tika.apache.org/tika.png"; alt="Apache Tika"
> +                width="292" height="100"/></a>
> +        <a href="http://www.apache.org/"; id="bannerRight"
> +           title="The Apache Software Foundation"
> +          ><img src="http://tika.apache.org/asf-logo.gif"; alt="The Apache 
> Software Foundation"
> +                width="387" height="100"/></a>
> +      </div>
> +      <div id="content">
> +        <!-- Licensed to the Apache Software Foundation (ASF) under one or 
> more --><!-- contributor license agreements.  See the NOTICE file distributed 
> with --><!-- this work for additional information regarding copyright 
> ownership. --><!-- The ASF licenses this file to You under the Apache 
> License, Version 2.0 --><!-- (the "License"); you may not use this file 
> except in compliance with --><!-- the License.  You may obtain a copy of the 
> License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 
> --><!--  --><!-- Unless required by applicable law or agreed to in writing, 
> software --><!-- distributed under the License is distributed on an "AS IS" 
> BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express 
> or implied. --><!-- See the License for the specific language governing 
> permissions and --><!-- limitations under the License. --><div 
> class="section">
> +<h2>Get Tika parsing up and running in 5 minutes<a 
> name="Get_Tika_parsing_up_and_running_in_5_minutes"></a></h2>
> +<p>This page is a quick start guide showing how to add a new parser to 
> Apache Tika. Following the simple steps listed below your new parser can be 
> running in only 5 minutes.</p>
> +<ul>
> +<li><a href="#Get_Tika_parsing_up_and_running_in_5_minutes">Get Tika parsing 
> up and running in 5 minutes</a>
> +<ul>
> +<li><a href="#Getting_Started">Getting Started</a></li>
> +<li><a href="#Add_your_MIME-Type">Add your MIME-Type</a></li>
> +<li><a href="#Create_your_Parser_class">Create your Parser class</a></li>
> +<li><a href="#List_the_new_parser">List the new 
> parser</a></li></ul></li></ul>
> +<div class="section">
> +<h3><a name="Getting_Started">Getting Started</a></h3>
> +<p>The <a href="./gettingstarted.html">Getting Started</a> document 
> describes how to build Apache Tika from sources and how to start using Tika 
> in an application. Pay close attention and follow the instructions in the 
> &quot;Getting and building the sources&quot; section.</p></div>
> +<div class="section">
> +<h3><a name="Add_your_MIME-Type">Add your MIME-Type</a></h3>
> +<p>Tika loads the core, standard MIME-Types from the file 
> &quot;org/apache/tika/mime/tika-mimetypes.xml&quot;, which comes from <a 
> class="externalLink" 
> href="http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml";>tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml</a>
>  . If your new MIME-Type is a standard one which is missing from Tika, submit 
> a patch for this file!</p>
> +<p>If your MIME-Type needs adding, create a new file 
> &quot;org/apache/tika/mime/custom-mimetypes.xml&quot; in your codebase. You 
> should add to it something like this:</p>
> +<div>
> +<pre> &lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
> + &lt;mime-info&gt;
> +   &lt;mime-type type=&quot;application/hello&quot;&gt;
> +          &lt;glob pattern=&quot;*.hi&quot;/&gt;
> +   &lt;/mime-type&gt;
> + &lt;/mime-info&gt;</pre></div></div>
> +<div class="section">
> +<h3><a name="Create_your_Parser_class">Create your Parser class</a></h3>
> +<p>Now, you need to create your new parser. This is a class that must 
> implement the Parser interface offered by Tika. Instead of implementing the 
> Parser interface directly, it is recommended that you extend the abstract 
> class AbstractParser if possible. AbstractParser handles translating between 
> API changes for you.</p>
> +<p>A very simple Tika Parser looks like this:</p>
> +<div>
> +<pre>/*
> + * Licensed to the Apache Software Foundation (ASF) under one or more
> + * contributor license agreements.  See the NOTICE file distributed with
> + * this work for additional information regarding copyright ownership.
> + * The ASF licenses this file to You under the Apache License, Version 2.0
> + * (the &quot;License&quot;); you may not use this file except in compliance 
> with
> + * the License.  You may obtain a copy of the License at
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an &quot;AS IS&quot; 
> BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + * 
> + * @Author: Arturo Beltran
> + */
> +package org.apache.tika.parser.hello;
> +
> +import java.io.IOException;
> +import java.io.InputStream;
> +import java.util.Collections;
> +import java.util.Set;
> +
> +import org.apache.tika.exception.TikaException;
> +import org.apache.tika.metadata.Metadata;
> +import org.apache.tika.mime.MediaType;
> +import org.apache.tika.parser.ParseContext;
> +import org.apache.tika.parser.AbstractParser;
> +import org.apache.tika.sax.XHTMLContentHandler;
> +import org.xml.sax.ContentHandler;
> +import org.xml.sax.SAXException;
> +
> +public class HelloParser extends AbstractParser {
> +
> +        private static final Set&lt;MediaType&gt; SUPPORTED_TYPES = 
> Collections.singleton(MediaType.application(&quot;hello&quot;));
> +        public static final String HELLO_MIME_TYPE = 
> &quot;application/hello&quot;;
> +        
> +        public Set&lt;MediaType&gt; getSupportedTypes(ParseContext context) {
> +                return SUPPORTED_TYPES;
> +        }
> +
> +        public void parse(
> +                        InputStream stream, ContentHandler handler,
> +                        Metadata metadata, ParseContext context)
> +                        throws IOException, SAXException, TikaException {
> +
> +                metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
> +                metadata.set(&quot;Hello&quot;, &quot;World&quot;);
> +
> +                XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, 
> metadata);
> +                xhtml.startDocument();
> +                xhtml.endDocument();
> +        }
> +}</pre></div>
> +<p>Pay special attention to the definition of the SUPPORTED_TYPES static 
> class field in the parser class that defines what MIME-Types it supports. If 
> your MIME-Types aren't standard ones, ensure you listed them in a 
> &quot;custom-mimetypes.xml&quot; file so that Tika knows about them (see 
> above).</p>
> +<p>Is in the &quot;parse&quot; method where you will do all your work. This 
> is, extract the information of the resource and then set the 
> metadata.</p></div>
> +<div class="section">
> +<h3><a name="List_the_new_parser">List the new parser</a></h3>
> +<p>Finally, you should explicitly tell the AutoDetectParser to include your 
> new parser. This step is only needed if you want to use the AutoDetectParser 
> functionality. If you figure out the correct parser in a different way, it 
> isn't needed. </p>
> +<p>List your new parser in: <a class="externalLink" 
> href="http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser";>tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser</a></p></div></div>
> +      </div>
> +      <div id="sidebar">
> +        <div id="navigation">
> +                    <h5>Apache Tika</h5>
> +            <ul>
> +              
> +    <li class="none">
> +                    <a href="../index.html">Introduction</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../download.html">Download</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../contribute.html">Contribute</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="../mail-lists.html">Mailing Lists</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://wiki.apache.org/tika/"; 
> class="externalLink">Tika Wiki</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="https://issues.apache.org/jira/browse/TIKA"; 
> class="externalLink">Issue Tracker</a>
> +          </li>
> +          </ul>
> +              <h5>Documentation</h5>
> +            <ul>
> +              
> +          
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="expanded">
> +                    <a href="../1.5/index.html">Apache Tika 1.5</a>
> +                  <ul>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/gettingstarted.html">Getting Started</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/formats.html">Supported Formats</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/parser.html">Parser API</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/parser_guide.html">Parser 5min Quick 
> Start Guide</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/detection.html">Content and Language 
> Detection</a>
> +          </li>
> +                  
> +    <li class="none">
> +                    <a href="../1.5/api/">API Documentation</a>
> +          </li>
> +              </ul>
> +        </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.4/index.html">Apache Tika 1.4</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.3/index.html">Apache Tika 1.3</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.2/index.html">Apache Tika 1.2</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.1/index.html">Apache Tika 1.1</a>
> +                </li>
> +              
> +                
> +                    
> +                  
> +                  
> +                  
> +                  
> +                  
> +              
> +        <li class="collapsed">
> +                    <a href="../1.0/index.html">Apache Tika 1.0</a>
> +                </li>
> +          </ul>
> +              <h5>The Apache Software Foundation</h5>
> +            <ul>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/foundation/"; 
> class="externalLink">About</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/licenses/"; 
> class="externalLink">License</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/security/"; 
> class="externalLink">Security</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a 
> href="http://www.apache.org/foundation/sponsorship.html"; 
> class="externalLink">Sponsorship</a>
> +          </li>
> +              
> +    <li class="none">
> +                    <a href="http://www.apache.org/foundation/thanks.html"; 
> class="externalLink">Thanks</a>
> +          </li>
> +          </ul>
> +      
> +          <div id="search">
> +            <h5>Search with Apache Solr</h5>
> +            <form action="http://search.lucidimagination.com/p:tika";
> +                  method="get" id="searchform">
> +              <input type="text" id="query" name="q"/>
> +              <select name="searchProvider" id="searchProvider">
> +                <option value="any">provider</option>
> +                <option value="lucid">Lucid Find</option>
> +                <option value="sl">Search-Lucene</option>
> +              </select>
> +              <input type="submit" id="submit" value="Search" name="Search"
> +                     onclick="selectProvider(this.form)"/>
> +            </form>
> +          </div>
> +
> +          <div id="bookpromo">
> +            <h5>Books about Tika</h5>
> +            <p>
> +              <a href="http://manning.com/mattmann/"; title="Tika in Action"
> +                ><img src="../mattmann_cover150.jpg"
> +                      width="150" height="186"/></a>
> +            </p>
> +          </div>
> +        </div>
> +      </div>
> +      <div id="footer">
> +        <p>
> +          Copyright &#169; 2014
> +          <a href="http://www.apache.org/";>The Apache Software 
> Foundation</a>.
> +          Site powered by <a href="http://maven.apache.org/";>Apache 
> Maven</a>. 
> +          Search powered by
> +          <a href="http://www.lucidimagination.com";>Lucid Imagination</a>
> +          and <a href="http://sematext.com";>Sematext</a>.
> +          <br/>
> +          Apache Tika, Tika, Apache, the Apache feather logo, and the Apache
> +          Tika project logo are trademarks of The Apache Software Foundation.
> +        </p>
> +      </div>
> +    </div>
> +  </body>
> +</html>
> 
> 

Reply via email to