OK was able to merge these in -didn't stomp over what I was doing.
Thanks Nick, no worries!

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architct
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168527
Email: [email protected]
WWW:  
http://sunset.usc.edu/~mattmann/++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Ajunct Associate Professor, Computer Science Department
University of Southrn California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Mattmann>, Chris Mattmann <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, September 5, 2014 12:19 PM
To: "<devtika.apache.org>" <[email protected]>
Cc: "[email protected]" [email protected]>
Subject: Re: svn commit: r1622762 [1/2] - in /tika/site/publish:
1.4/gettingstarted.html 1.5/gettingstarted.html 1.6/detection.html
1.6/formats.html 1.6/gettingstarted.html 1.6/parser.html
1.6/parser_guide.html 1.7/ 1.7/examples.html 1.7/formats.html index.html

>Nick I'm working on this cam u hold off?
>
>Sent from my iPone
>
>> On Sep 5, 2014, at 12:15 PM, "[email protected]" <[email protected]> wrote:
>> 
>> Author: nick
>> Date: Fri Sep  5 19:14:58 2014
>> New Revision: 1622762
>> 
>> URL: http://svn.apache.org/r162272
>> Log:
>> Republish the site
>> 
>> Added:
>>    tika/site/publish/1./detection.html
>>   tika/site/publish/1.6/gettingstarted.html
>>    tika/site/publish/16/parser.html
>>    tika/site/publish/1.6/parser_guide.html
>>    tika/site/pblish/1.7/
>>    tika/site/publish/1.7/examples.html
>>    tika/site/publish/1.7/formats.html
>> Modified:
>>    tika/site/publish/1.4/gettingstarted.html
>>    tika/site/publish/1.5/gettingstarted.html
>>    ika/site/publish/1.6/formats.html
>>    tika/site/publish/index.html
>> 
>> Modified: tika/site/publish/1.4/gettingstarted.html
>> URL: 
>>http://vn.apache.org/viewvc/tika/site/publish/1.4/gettingstarted.html?re
>>v=162272&r1=1622761&r2=1622762&view=diff
>> 
>>=========================================================================
>>=====
>> --- tika/site/publish/1.4/gettingstarted.html (original)
>> +++ tika/sit/publish/1.4/gettingstarted.html Fri Sep  5 19:14:58 2014
>> @@ -94,13 +9413 @@
>> <div>
>> <pre>mvn install</pre></div>
>> <p>See the Maven documentation for more information about the avalable
>>build options.</p>
>> -<p>Note that you need Java 5 or higher to build Tika.</p></div>
>> +<p>Note that you need Java 6 or higher to build Tika.</p></div>
>> <div class="section">
>> <h>Build artifacts<a name="Build_artifacts"></a></h2>
>> <p>The Tika build consists of a number of components and produces the
>>following main binaries:</p>
>> <dl>
>> <dt>tika-core/target/tika-core*.jar</dt>
>> -<dd> Tika core library. Contains the core interfaces and classes of
>>Tika, but none of the parser implementtions. Depends only on Java
>>5.</dd>
>> +<dd> Tika core library. Contais the core interfaces and classes of
>>Tika, but none of the parser implementations. Depends only on Java
>>6.</dd>
>> <dt>tika-parsers/target/tika-parsers-*.jar</dt>
>> <dd> Tika parsers. Collection of classes that implement he Tika Parser
>>interface based on various external parser libraries.</dd
>> <dt>tika-app/target/tika-app-*.jar</dt>
>> 
>> Modified: tika/site/ublish/1.5/gettingstarted.html
>> URL: 
>>http://svn.apache.org/viewvc/tikasite/publish/1.5/gettingstarted.html?re
>>v=1622762&r1=1622761&r2=162762&view=diff
>> 
>>========================================================================
>>====
>> --- tika/site/publish/1.5/gettingstarted.html (original)
>> ++ tika/site/publish/1.5/gettingstarted.html Fri Sep  5 19:14:58 2014
>> @@ -94,13 +94,13 @@
>> <div>
>> <pre>mvn install</pre></div>
>> <p>See the Maven documentation for more information about the available
>>build options.</p>
>> -<p>Not that you need Java 5 or higher to build Tika.</p></div>
>> +<p>Note that ou need Java 6 or higher to build Tika.</p></div>
>> <div class="section">>> <h2>Build artifacts<a 
>> name="Build_artifacts"></a></h2>
>> <p>The Tika build consists of anumber of components and produces the
>>following main binaries:</p>
> <dl>
>> <dt>tika-core/target/tika-core-*.jar</dt>
>> -<dd> Tika core lbrary. Contains the core interfaces and classes of
>>Tika, but none of the arser implementations. Depends only on Java
>>5.</dd>
>> +<dd> Tika core library. Contains the core interface and classes of
>>Tika, but none of the parser implementations. Depends oly on Java
>>6.</dd>
>> <dt>tika-parsers/target/tika-parsers-*.jar</dt>
>> <dd> Tika parsers. Collection of classes that implement the Tika Parser
>>interface based on arious external parser libraries.</dd>
>> <dt>tika-app/target/tikaapp-*.jar</dt>
>> 
>> Added: tika/site/publish/1.6/detection.html
>> RL: 
>>http://svn.apache.org/viewvc/tika/site/publish/.6/detection.html?rev=162
>>2762&view=auto
>> 
>>=========================================================================
>>=====
>> --- tika/site/publish/1.6/detection.html (added)
>> +++ tika/site/publish/1.6/detection.html Fri Sep  5 19:14:58 2014
>> @@ -0,0 +1,357 @@
>> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> +          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>> +
>> +<!--
>> +  Licensed to the Apache Software Foundation (ASF) under one
>> +  or more contributor license agreements.  See the NOTICE file
>> +  distributed with this work for additional information
>> +  regarding copyright ownership.  The ASF licenses this file
>> +  to you under the Apache License, Version 2.0 (the
>> +  "License"); you may not use this file except in compliance
>> +  wih the License.  You may obtain a copy of the License at
>> +
>> +    http://www.apache.org/licenses/LICENSE-2.0
>> + 
>> +  Unles required by applicable law or agreed to in writing,>> +  software 
>> distributed under the License is distributed on an
>> +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
>> +  KIND, either express or implied.  See the License for the
>> +  fic language governipermissions and limiions
>> +  under thecense.
>> +-->
>> +
+
>> +
>> +
>> +
>> > +
>> +<html xmlns="http://www.w3.org/1999/xhtml";>
>> +  <head>
>> +    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"
>>/>
>> +    <title>Apache Tika ontent Detection</title>
>> +    <style type="text/css" media="all">
>> +      @import url("../ss/site.css");
>> +    </style>
>> +    <link rel="icotype="image/png" href="../tikaNoText16.png" />
>> +    <script type="text/javascript">
>> +      function selectProvider(form) {
>> +        pider = form.elements['searchProvider'].value;
>> +        if (provider == "any") {
>> +          if (Math.random() > 0.5) {
>> +         provider = "lucid";
>> +          } else {
>> +            provider = "sl";
>> +          }
>>+        }
>> +        if (provider == "lucid") {
>> +          m.action = "http://find.searchhub.org/p:tika";;
>> +        } else if (provider == "sl") {
>> +         form.action = "http://search-lucene.com/tika";;
>> +     }
>> +        days = 90;
>> +        date = new Date();
>> +        date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
>> +        expires = "; expires=" + date.toGMTString();
>> +       document.cookie = chProvider=" + provi + expires + ";
>>pa/";
>> +      }
>> +   function initProvr() {
>> +        ifocument.cookie.length>0) {
>> +          cStart=document.cookie.indexOf("searchProvider=");
>> +          if (cStart!=-1) {
>> +            cStart=cStart + "searchProvider=".length;
>> +           cEnd=document.e.indexOf(";", cStar
>> +            if nd==-1) {
>> +           cEnd=document.cie.length;
>> +         }
>> +            provider =
>>unescape(document.cookie.substring(cStart,cEnd));
>> +            
>>document.forms['searchform'].elements['searchProvider'].value = provider;
>> +         }
>> +        }
>> +   document.forms['schform'].elements['qfocus();
>> +      } +    </script>
>> +/head>
>> +  <body oad="initProvider();">
>> +    <div id="body">
>> +      <div id="banner">
>> +        <a href="http://tika.apache.org"; id="bannerLeft" title="Apache
>>Tika"
>> +          ><img src="http://tia.apache.org/tika.pnt="Apache Tika"
>> +             width="" height="100"/></a> +        <a 
>> href="h://www.apache.org/" "bannerRight"
>> +           title="The Apache Software Foundation"
>> +          ><img src="http://tika.apache.org/asf-logo.gif"; alt="The
>>Apache Software Foundation"
>> +                with="387" height="100a>
>> +      </div>
+      <div id="cont">
>> +        <!-- ensed to the Apache tware Foundation (ASunder
>>one or more --><!-- contributor license agreements.  See the NOTICE file
>>distributed with --><!-- this work for additional information regarding
>>copyright ownership. --><!-- The ASF licenses this file to You under the
>>Apache License, Version 2.0 --><!-- (the "License"); you may not use
>>this file except in compliance with --><!-- the License.  Youmay obtain
>>a copy of the License at --><!--  --><!--
>>http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless
>>required by applicable law or agreed t in writing, software --><!--
>>distributed under the License is distributed on an "AS IS" BASIS,
>>--><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either expres or
>>implied. --><!-- See the License for the specific language governing
>>permissions and --><!-- limitations under the Licese. --><div
>>class="section">
>> +<h2>Content Detection<a ame="Content_Detection"></a></h2>
>> +<p>This page gives you information on how content and language
>>detection works with Apache ika, and how to tune the behaviour of
>>Tika.</p>
>> +ul>
>> +<li><a href="#Content_Detection">Content Detection</a>
>> +<ul>
>> +<li><a href="#The_Detector_Interface">The Detector Interface</a></li>
>> +<li><a href="#Mime_Magic_Detction">Mime Magic Detction</a></li>
>> +<li><a href="#Resource_Name_Based_Detection">Resource Name Based
>>Detection</a></li>
>> +<li><a href="#Known_Content_Type_Detection">Known Content Type
>>&quot;Detection</a></li>
>> +<li><a href="#The_default_Mime_Types_Detector">The default Mime Types
>>Detector</a></li>
>> +<li><a href="#Container_Aware_Detection">Container Aware
>>Detection</a></li>
>> +<li><a href="#The_default_Tika_Detector">The default Tika
>>Detector</a></li>
>> +<li><a href="#Ways_of_triggering_Deection">Ways of triggering
>>Detection</a></li>
>> +<li><a href="#Language_Detection">Language
>>Detection</a></li></ul></li></ul>
>> +<div class="section">
>> +<h3><a name="The_Detector_Interface">The Detector Interface</a></h3>
>> +<p>The <a 
>>href="./api/org/apache/tika/detect/Detector.html">og.apache.tika.detect.
>>Detector</a> interface is the basis for most of the content type
>>detection in Apache Tika. All the different ways of detecting content
>>all implement the same common method:</p>
>> +<div>
>> +<pre>MediaType detect(java.io.InputStream input,
>> +                 Metadata metadata) throws
>>java.io.IOException</pre></div>
> +<p>The <tt>detect</tt> method takes the stream to inspect, and a
>><tt>Metadata</tt> obect that holds any additional information on the
>>content. The detector will return a <a
>>href="./api/org/pache/tika/mime/MediaType.html">MediaType</a> object
>>describing its best guess as to the type of the file.</p>
>> +<p>In general, only two keys on the Metadata object are ued by
>>Detectors. These are <tt>Metadata.RESOURCE_NAME_KEY</tt> which should
>hold the name of the file (where known), and
>><tt>Metadata.CONTENT_TYPE</tt> which should hold the advertised content
>>type of the file (eg from a webserver or a content repository).</p></div>
>> +<div class="section">
>> +<h3><a name="Mime_Magic_Detction">Mime Magic Detction</a></h3>
>> +<p>By looking for special (&quot;magic&quot;) patterns of bytes near
>>the start of the file, it is often possible to detect the type of the
>>file. For some file types, this is a simple process. For others,
>>typically container based formats, the magic detection may not be
>>enough. (More detail on detecting container formats below)</p>
>> +<p>Tika is able to make use of a a mime magic info file, in the <a
>>class="externalLink"
>>href="http://www.freedesktop.org/standards/shared-mime-info";>Freedesktop
>>MIME-info</a> format to peform mime magic detection. (Note that Tika
>>supports a few more match types than Freedesktop does)</p>
>> +<p>This is provided within Tika by <a
>>href="./api/org/apache/tika/detect/MagicDetector.html">org.apache.tika.de
>>tect.MagicDetector</a>. It is most commonly access via <a
>>href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.Mim
>>eTypes</a>, normally sourced from the <tt>tika-mimetypes.xml</tt> and
>><tt>custom-mimetypes.xml</tt> files. For more information on defining
>>your own custom mimetypes, see <a
>>href="./parser_guide.html#Add_your_MIME-Type">the new parser
>>guide</a>.</p></div>
>> +<div class="section">
>> +<h3><a name="Resource_Name_Based_Detection">Resource Name Based
>>Detection</a></h3>
>> +<p>Where the name of the file is known, it is sometimes possible to
>>guess the file type frm the name or extension. Within the
>><tt>tika-mimetypes.xml</tt> file s a list of patterns which are used to
>>identify the type from the filename.</p>
>> +<p>However, because files may be renamed, this method of detection is
>>quick but not always as accurate.</p>
>> +<p>This is provided within Tika by <a
>>href="./api/org/apache/tika/detect/NameDetector.html">org.apache.tika.det
>>ect.NameDetector</a>.</p></div>
>> +<div class="section">
>> +<h3><a name="Known_Content_Type_Detection">Known Content Type
>>&quot;Detection</a></h3>
>> +<p>Sometimes, the mime type for a file is already known, such as when
>>downloading from a webserver, or when retrieving from a content store.
>>This information can be used by detectors, such as <a
>>href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.Mim
>>eTypes</a>,</p></div>
>> +<div class="section">
>> +<h3><a name="The_default_Mime_Types_Detector">The default Mime Types
>>Detector</a></h3>
>> +<p>By default, the mime type detection in Tika is provided by <a
>>href="./api/org/apache/tika/mime/MimeTypes.html">org.apahe.tika.mime.Mim
>>eTypes</a>. This detector makes use of <tt>tika-mimetypes.xml</tt> to
>>power magic based and filename based detection.</p>
>> +<p>Firstly, magic based detection is used on the start of the file. If
>>the file is an XML file, then the start of the XML is processed to look
>>for root elements. Next, if available, the filename (from
>><tt>Metadata.RESOURCE_NAME_KEY</tt>) is then used to improve the detail
>>of the detecton, such as when magic detects a text file, and the
>>filename hints it's really a CSV. Finally, if available, the supplied
>>content type (from <tt>Metadata.CONTENT_TYPE</tt>) is used to further
>>refine the type.</p></div>
>> +<div class="section">
>> +<h3><a name="Container_Aware_Detection">Container Aware
>>Detection</a></h3>
>> +<p>Several common file formats are actually held within a common
>>container format. One example is the PowerPoit .ppt and Word .doc
>>formats, which are both held within an OLE2 container. Another is Apple
>>iWork formats, which are actually a series of XML files within a Zip
>>file.</p>
>> +<p>Using magic detection, it is easy to spot that a given file is an
>>OLE2 document, or a Zip file. Using magic detection alone, it is 
>>very>>difficult (and often impossible) to tell what kind of file lives inside
>>the container.</p>
>> +<p>For some use cases, speed is important, so having a quick way to
>>know the container type is suficient. For other cases however, you
>>don't mind spending a bit of time (ad memory!) processing the container
>>to get a more accurate answer on itscontents. For these cases, the
>>additional container aware detectors contaned in the <tt>Tika
>>Parsers</tt> jar should be used.</p>
>> +<p>Tika rovides a wrapping detector in the form of <a
>>href="./api/org/apache/tikadetect/DefaultDetector.html">org.apache.tika
>>detect.DefaultDetector</a>. This uses the service loader to discovr all
>>available detectors, including any available container aware ons, and
>>tries them in turn. For container aware detection, include he <tt>Tika
>>Parsers</tt> jar and its dependencies in your project, then se
>>DefaultDetector along with a <tt>TikaInputStream</tt>.</p>
>> +<p>ecause these container detectors needs to read the whole file to
>open and inspect the container, they must be used with a <a
>>href="./apiorg/apache/tika/io/TikaInputStream.html">org.apache.tika.io.T
>>ikaInputStream</a>. If called with a regular <tt>InputStream</tt>, then
>>al work will be done by the default Mime Magic detection only.</p>
>> +<p>Fo more information on container formats and Tika, see <a
>>class="externalLink"
>>href="http://wiki.apache.org/tika/MetdataDiscussion";></a></p></div>
>> +<div class="section">
>> +<h3><a name="The_default_Tika_Detector">The default Tika
>>Detector</a></h3>
>> +<p>Just as with Parsers, Tika provides a special detector <a
>>href="./api/org/apache/tika/detect/efaultDetector.html">org.apache.tika.
>>detect.DefaultDetector</a> whch auto-detects (based on service files)
>>the available detectors at runtime, and tries these in turn to identify
>>the file type.</p>
>> +<p>If only <tt>Tika Core</tt> is available, the Deault Detector will
>>work only with Mime Magic and Resource Name detectio. However, if
>><tt>Tika Parsers</tt> (and its dependencies!) are availabl, additional
>>detectors which known about containers (such as zip and ole2) will be
>>used as appropriate, provided that detection is being performed with a
>><a 
>>href="./api/org/apache/tika/io/TikaInputStream.html">org.apahe.tika.io.T
>>ikaInputStream</a>. Custom detectors can also be used as desred, they
>>simply needto be listed in a service file much as is done for <a
>>href="./parser_guide.html#List_the_new_parser">custom
>>parsers</a>.</p></div>
>> +<div class="section">
>> +<h3><a name="Ways_of_triggering_Detection">Ways of triggering
>Detection</a></h3>
>> +<p>The simplest way to detect is through the a
>>href="./api/org/apache/tika/Tika.html">Tika Facade class</a>, whih
>>provides methods to detect based on <a
>>href="./api/org/apache/tika/Tka.html#detect(java.io.File)">File</a>, <a
>>href="./api/org/apache/tika/Tika.html#detect(java.io.InputStream)">InputS
>>tream</a>, <a 
>>href="./api/org/apache/tika/Tika.html#detect(java.io.InputStream,
>>java.lang.String)">InputStream and Filename</a>, <a
>>href="./api/org/apache/tika/Tika.html#detect(java.lang.String)">Filename<
>>/a> or a few others. It works best with a File or <a
>>href="./api/org/apache/tika/io/TikInputStream.html">TikaInputStream</a>.
>></p>
>> +<p>Alternately, detection can be performe on a specific Detector, or
>>using <tt>DefaultDetector</tt> to have all aailable Detectors used. A
>>typical pattern would be something like:</p
>> +<div>
>> +<pre>TikaConfig tika = new TikaConfig();
>> +
>> +for(File f : myListOfFiles) {
>> +   Metadata metadata = new Metadata();
>> +   metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString());
>> +   String mimetype = tika.getDetector().detect(
>> +        TikaInputStream.get(f), metadata);
>> +   System.out.println(&quot;File quot; + f + &quot; is &quot; +
>>mimetype);
>> +}
>> +for (InputStream is : myListOfStreams) {
>> +   String mimetype = tika.getDetector().detect(
>> +       TikaInputStream.get(is), new Metadata());
>> +   System.out.println(&quot;Sream &quot; + is + &quot; is &quot; +
>>mimetype);
>> +}</pre></div></div>
>> +<div class="section">
>> +<h3><a name="Language_Detection">Language Detection</a></h3>
>> +<p>Tika is able to help identify the language of a piece of text,
>>which is useful when extracting text from document ormats which do not
>>include language information in their metadata.</p>
>> +<p>The langage detection is provided by <a
>>href="./api/org/apache/tika/language/LanguageIdentifier.html">org.apache.
>>tika.language.LanguageIdentifier</a></p></div></div>
>> +      </div>
>> +      <div id="sidebar">
>> +        <div id="navigation">
>> +                    <h5>Apache Tika</h5>
>> +            <ul>
>> +              
>> +    <li class="none">
>> +                    <a href="../index.html">Introduction</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../download.html">Download</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../contribute.html">Contribute</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../mail-lists.html">Mailing Lists</a>
>> +          </li>
>> +             
>> +    <li class="none">
>> +                    <a href="http://wiki.apache.org/tika/";
>>class="externalLink">Tika Wii</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a
>>href="https://isses.apache.org/jira/browse/TIKA";
>>class="externalLink">Issue Tracker</a>
>> +          </li>
>> +          </ul>
>> +              <h5>Documentation</h5>
>> +            <ul>
>> +              
>> +          
>> +               
>> +               
>> +               
>> +               
>> +               
> +               
>> +              
>> +        <li class="expanded">
>> +                    <a href="../1.5/index.html">Apache Tika 1.5</a>
>> +                  <ul>
>> +               
>> +    <li class="none">
>> +                   <a href="../1.5/gettingstarted.html">Getting
>>Started</a>
>> +          </li>
>> +               
>> +    <li class="none">
>> +                    <a href="../1.5/formats.html">Supported Formats</a>
>> +          /li>
>> +               
>> +    <li class="none">
>> +                   <a href="../1.5/parser.html">Parser API</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../.5/parser_guide.html">Parser 5min
>>Quick Start Guide</a>
>> +          </li>
>> +               
> +    <li class="none">
>> +                    <a href="../1.5/detetion.html">Content and
>>Language Detection</a>
>> +          </li>
>> +               
>> +    <li class="none">
>> +                    <a href="../1.5/api/">API Documentation</a>
>> +         </li>
>> +              </ul>
>> +        </li>
>> +             
>> +               
>> +               
>> +               
>> +              
>> +               
>> +               
>> +               
>> +              
>> +       <li class="collapsed">
>> +                    <a href="../1.4/index.html">Apache Tika 1.4</a>
>> +                </li>
>> +              
>> +               
>> +               
>> +               
>> +               
>> +               
>> +               
>> +               
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.3/index.html">Apache Tika 1.3</a>
>> +                </li>
>> +              
>> +               
>> +               
>> +               
>> +               
>> +               
>> +               
>> +               
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.2/index.html">Apache Tika 1.2</a>
>> +                </li>
>> +              
>> +               
>> +               
>> +               
>> +               
>> +               
>> +               
>> +               
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.1/index.html">Apache Tika 11</a>
>> +                </li>
>> +              
>> +               
>> +               
>> +               
>> +              
>> +               
>> +               
>> +              
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.0/index.html">Apache Tika 1.0</a>
>> +                </li>
>> +          </ul>
>> +             The Apache Software ndation</h5>
>> +         <ul>
>> +           
>> +    <li cs="none">
>> +                 <a href="http://www.apache.org/foundation/";
>>class="externalLink">About</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                 <a href="http://www.apache.org/licenses/";
>>class="externalLink">License</a>
>> +          </i>
>> +              
>> +    <li class="none">
>> +                 <a href="http://www.apache.org/security/";
>>class="externalLink">Security</a>
>> +          </li>
>> +              
>> +    <llass="none">
>> +                    <a
>>href="http://www.apache.org/foundation/sponsorship.html";
>>class="externalLink">Sponsorship<
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a
>>hrf="http://www.apache.org/foundation/thanks.html";
>>class="externink">Thanks</a>
>> +          </li>
>> +          </ul>
>> +      
>> +          <div id="searh">
>> +            <h5>Search with Apache Solr</h5>
>> +         <form action="http://search.lucidimagination.com/p:tika";
>> +                  method="get" id="searchform">
>> +              <input type="text" id="query" name="q"/>
>> +              <select ame="searchProvider"searchProvider">
>>               <optioalue="any">provider<tion>
>> +             <option value="lu">Lucid Find</option>
>> +                <option value="sl">Search-Lucene</option>
>> +              </select>
>> +              <input type="submit" id="submit" value="Search"
>>name="Searc"
>> +                onclick="selectProer(this.form)"/>
>>           </form>
>>         </div>
>> + +          <div id=okpromo">
>> +            <h5>Books about Tika</h5>
>> +            <p>
>> +              <a href="http://manning.com/mattmann/"; title="Tika in
>>Action"
>> +                ><img src="../matmann_cover150.jpg"
>                    th="150" height="186</a>
>> +           p>
>> +          </d
>> +        </div>
+      </div>
>> +      <div id="footer">
>> +        <p>
>> +          Copyright &#169; 2014
>> +          <a href="http://www.apache.org/";>The Apache Software
>>Foundation</a>.
>> +         Site powered by <a "http://maven.apacheg/";>Apache
>>Maven</ 
>> +          Searpowered by
>> +       <a href="http://wwucidimagination.com";>Lucid
>>Imagination</a>
>> +          and <a href="http://sematext.com";>Sematext</a>.
>> +          <br/>
>> +          Apache Tika, Tika, Apache, the Apache feather logo and the
>>Apache
>>       Tika project o are trademarks of  Apache Software
>>Fdation.
>> +        >
>> +      </div>
>    </div>
>> +  </body>
>> +</html>
>> 
>> Modified: tika/site/publish/1.6/formats.html
>> URL: 
>>http://svn.apache.org/viewvc/tika/site/publish/1.6/formats.html?rev=16227
>>62&r1=1622761&r2=1622762&view=diff
>> 
>>=========================================================================
>>=====
>> --- tika/site/publish/1.6/formats.html (original)
>> +++ tika/ite/publish/1.6/formats.html Fri Sep  5 19:14:58 2014
>> @@ -110,7 +110,9 @@
>> <li><a href="#Mail_formats">Mail formats</a></li>
>> <li><a href="#CAD_formats">CAD ormats</a></li>
>> <li><a href="#Font_formats">Font formats</a></li>
>> -<li><a href="#Executable_programs_and_libraries">Executable programs
>>and libraries</a></li>/ul></li></ul>
>> +<li><a href="#Scientific_formats">Scientific formats</a></li>
>> +<li><a href="#Executable_programs_and_libraies">Executable programs
>>and libraries</a></li>
>> +<li>< href="#Crypto_formats">Crypto formats</a></li></ul></li></ul>
>> <div class="section">
>> <h3><a name="HyperText_Markup_Language">yperText Markup
>>Language</a></h3>
>> <p>The HyperTex Markup Language (HTML) is the lingua franca of the
>>web. Tika uses the <a class="externalLink"
>>href="http://home.ccil.org/~cowan/XML/tagsoup/";>TagSoup</a> library to
>>support virtually any kind of HTML found on the web. The output from the
>><a 
>>href="./api/org/apache/tika/parser/html/HtmlParser.html">HtmlParser</a>
>>class is guaranteed to be well-formed and valid XHTML, and various
>>heuristics are used to prevent things like inline scripts from
>>cluttering the extracted text content.</p></div>
>> @@ -131,7 +133,8 @@
>> <p>The <a 
>>href="./api/org/apache/tika/parser/pdf/PDFParser.html">PDFParser</a>
>>class parsers Portable Document Format (PDF) documents using the <a
>>class"externalLink" href="http://pdfbox.apache.org/";>Apache PDFBox</a>
>>library.</p></div>
>> <div class="section">
>> <h3><a name="Electronic_Publication_Format">Electronic Publication
>>Format</a></h3>
>> -<p>The <a 
>>href="./api/org/apache/tika/parser/epub/EpubParser.html">EpubParser</a>
>>class spports the Electronic Publication Format (EPUB) used for many
>>digital books.</p></div>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/epub/EpubParser.html">EpubParser</a>
>>class supports the Electronic Publication Format (EPUB) used for many
>>digital books.</p>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/xml/FictionBookParser.html>FictionBoo
>>kParser</a> class supports the xml-based Fiction Book publishing
>>format.</></div>
>> <div class="section">
>> <h3><a name="Rich_Text_Format">Rich Text Format</a></h3>
>> <p>The <a 
>href="./api/org/apache/tika/parser/rtf/RTFParser.html">RTFParser</a>
>>class uses the standard javax.swing.text.rtf feature to extract text
>>content from Rich Text Format (RF) documents.</p></div>
>> @@ -143,7 +146,8 @@
>> <p>Extracting text content frm plain text files seems like a simple
>>task until you start thinking of all the possible character encodings.
>>The <a 
>>href="./api/org/apache/tika/parser/txt/TXTParser.html">TXTParser</a>
>>class uses encoding detection code from the <a class="externalLink"
>>href="http://site.icu-project.org/";>ICU</a> project to automatically
>>detect the character encoding of a text document.</p></div>
>> <div class="section">
>> <h3><a name="Feed_and_Syndication_formats">Feed and Syndication
>>formats</a></h3>
>> -<p>The <a 
>>href="./api/org/apache/tika/parser/feed/FeedParser.html">FeedParser</a>
>>class supports the RSS and Atom feed syndication formats.</p></div>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/feed/FeedParser.html">FeedParser</a>
>>class supports the RSS and Atom feed syndication formats.</p>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/iptc/IptcAnpaParser.html">IptcAnpaPars
>>er</a> class supports the IPTC ANPA News Wire feed format.</p></div>
>> <div class="section">
>> <h3><a name="Help_formats">Help formats</a></h3>
>> <p>The <a 
>>href="./api/org/apache/tika/parser/chm/ChmParser.html">ChmParser</a>
>>class supports the CHM Help format.</p></div>
>> @@ -167,6 +171,7 @@
>> <div class="section">
>> <h3><a name="Mail_formats">Mail formats</a></h3>
>> <p>The <a 
>>href="./api/org/apache/tika/parser/mbox/MboxParser.html">MboxParser</a>
>>can extract email messages from the mbox format used by many email
>>archives and Unix-style mailboxes.</p>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/mail/RFC822Parser.html">RFC822Parser</
>>a> can process single email messages in the RFC 822 format used by many
>>email clients in their archives / exports.</p>
>> <p>The <a 
>>href="./api/org/apache/tika/parser/mbox/PSTParser.html">PSDParser</a>
>>can extract email messages from the Microsoft Outlook PST email
>>format.</p></div>
>> <div class="section">
>> <h3><a name="CAD_formats">CAD formats</a></h3>
>> @@ -175,8 +180,16 @@
>> <h3><a name="Font_formats">Font formats</a></h3>
>> <p>The <a 
>>href="./api/org/apache/tika/parser/font/TrueTypeParser.html">TrueTypePars
>>er</a> class can extract simple metadata from the TrueType font format.
>>The <a 
>>href="./api/org/apache/tika/parser/font/AdobeFontMetricParser.html">Adobe
>>FontMetricParser</a> class does something similar for Adobe Font Metrics
>>files.</p></div>
>> <div class="section">
>> +<h3><a name="Scientific_formats">Scientific formats</a></h3>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/hdf/HDFParser.html">HDFParser</a> is
>>able to extract attribute metadata from the HDF scientific file
>>format.</p>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/netcdf/NetCDFParser.html">NetCDFParser
>></a> is able to extract attribute metadata from the NetCDF scientific 
>>file format.</p>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/mat/MatParser.html">MatParser</a> is 
>>able to extract attribute metadata from the Matlab scientific file 
>>format.</p></div>
>> +<div class="section">
>> <h3><a name="Executable_programs_and_libraries">Executable programs and 
>>libraries</a></h3>
>> -<p>The <a 
>>href="./api/org/apache/tika/parser/executable/ExecutableParser.html">Exec
>>utableParser</a> can extract metadata information on platforms, 
>>architectures and types from a range of executable formats and 
>>libraries, such as Windows Executables and Linux / BSD programs and 
>>libraries.</p></div></div>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/executable/ExecutableParser.html">Exec
>>utableParser</a> can extract metadata information on platforms, 
>>architectures and types from a range of executable formats and 
>>libraries, such as Windows Executables and Linux / BSD programs and 
>>libraries.</p></div>
>> +<div class="section">
>> +<h3><a name="Crypto_formats">Crypto formats</a></h3>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/crypto/Pkcs7Parser.html">Pkcs7Parser</
>>a> is able to parse the contents of PKCS7 signed messages, but doesn't 
>>include any information from the outer PKCS7 wrapper.</p></div></div>
>> <div class="section">
>> <h2>Full list of supported formats:<a 
>>name="Full_list_of_supported_formats:"></a></h2>
>> <ul>
>> @@ -270,6 +283,9 @@
>> <li>org.apache.tika.parser.mail.<a 
>>href="./api/org/apache/tika/parser/mail/RFC822Parser">RFC822Parser</a>
>> <ul>
>> <li>message/rfc822</li></ul></li>
>> +<li>org.apache.tika.parser.mat.<a 
>>href="./api/org/apache/tika/parser/mat/MatParser">MatParser</a>
>> +<ul>
>> +<li>application/x-matlab-data</li></ul></li>
>> <li>org.apache.tika.parser.mbox.<a 
>>href="./api/org/apache/tika/parser/mbox/MboxParser">MboxParser</a>
>> <ul>
>> <li>application/mbox</li></ul></li>
>> 
>> Added: tika/site/publish/1.6/gettingstarted.html
>> URL: 
>>http://svn.apache.org/viewvc/tika/site/publish/1.6/gettingstarted.html?re
>>v=1622762&view=auto
>> 
>>=========================================================================
>>=====
>> --- tika/site/publish/1.6/gettingstarted.html (added)
>> +++ tika/site/publish/1.6/gettingstarted.html Fri Sep  5 19:14:58 2014
>> @@ -0,0 +1,413 @@
>> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> +          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>> +
>> +<!--
>> +  Licensed to the Apache Software Foundation (ASF) under one
>> +  or more contributor license agreements.  See the NOTICE file
>> +  distributed with this work for additional information
>> +  regarding copyright ownership.  The ASF licenses this file
>> +  to you under the Apache License, Version 2.0 (the
>> +  "License"); you may not use this file except in compliance
>> +  with the License.  You may obtain a copy of the License at
>> +
>> +    http://www.apache.org/licenses/LICENSE-2.0
>> + 
>> +  Unless required by applicable law or agreed to in writing,
>> +  software distributed under the License is distributed on an
>> +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
>> +  KIND, either express or implied.  See the License for the
>> +  specific language governing permissions and limitations
>> +  under the License.
>> +-->
>> +
>> +
>> +
>> +
>> +
>> +
>> +
>> +<html xmlns="http://www.w3.org/1999/xhtml";>
>> +  <head>
>> +    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" 
>>/>
>> +    <title>Apache Tika - Getting Started with Apache Tika</title>
>> +    <style type="text/css" media="all">
>> +      @import url("../css/site.css");
>> +    </style>
>> +    <link rel="icon" type="image/png" href="../tikaNoText16.png" />
>> +    <script type="text/javascript">
>> +      function selectProvider(form) {
>> +        provider = form.elements['searchProvider'].value;
>> +        if (provider == "any") {
>> +          if (Math.random() > 0.5) {
>> +            provider = "lucid";
>> +          } else {
>> +            provider = "sl";
>> +          }
>> +        }
>> +        if (provider == "lucid") {
>> +          form.action = "http://find.searchhub.org/p:tika";;
>> +        } else if (provider == "sl") {
>> +          form.action = "http://search-lucene.com/tika";;
>> +        }
>> +        days = 90;
>> +        date = new Date();
>> +        date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
>> +        expires = "; expires=" + date.toGMTString();
>> +        document.cookie = "searchProvider=" + provider + expires + "; 
>>path=/";
>> +      }
>> +      function initProvider() {
>> +        if (document.cookie.length>0) {
>> +          cStart=document.cookie.indexOf("searchProvider=");
>> +          if (cStart!=-1) {
>> +            cStart=cStart + "searchProvider=".length;
>> +            cEnd=document.cookie.indexOf(";", cStart);
>> +            if (cEnd==-1) {
>> +              cEnd=document.cookie.length;
>> +            }
>> +            provider = 
>>unescape(document.cookie.substring(cStart,cEnd));
>> +            
>>document.forms['searchform'].elements['searchProvider'].value = provider;
>> +          }
>> +        }
>> +        document.forms['searchform'].elements['q'].focus();
>> +      }
>> +    </script>
>> +  </head>
>> +  <body onLoad="initProvider();">
>> +    <div id="body">
>> +      <div id="banner">
>> +        <a href="http://tika.apache.org"; id="bannerLeft" title="Apache 
>>Tika"
>> +          ><img src="http://tika.apache.org/tika.png"; alt="Apache Tika"
>> +                width="292" height="100"/></a>
>> +        <a href="http://www.apache.org/"; id="bannerRight"
>> +           title="The Apache Software Foundation"
>> +          ><img src="http://tika.apache.org/asf-logo.gif"; alt="The 
>>Apache Software Foundation"
>> +                width="387" height="100"/></a>
>> +      </div>
>> +      <div id="content">
>> +        <!-- Licensed to the Apache Software Foundation (ASF) under 
>>one or more --><!-- contributor license agreements.  See the NOTICE file 
>>distributed with --><!-- this work for additional information regarding 
>>copyright ownership. --><!-- The ASF licenses this file to You under the 
>>Apache License, Version 2.0 --><!-- (the "License"); you may not use 
>>this file except in compliance with --><!-- the License.  You may obtain 
>>a copy of the License at --><!--  --><!-- 
>>http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless 
>>required by applicable law or agreed to in writing, software --><!-- 
>>distributed under the License is distributed on an "AS IS" BASIS, 
>>--><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>implied. --><!-- See the License for the specific language governing 
>>permissions and --><!-- limitations under the License. --><div 
>>class="section">
>> +<h2>Getting Started with Apache Tika<a 
>>name="Getting_Started_with_Apache_Tika"></a></h2>
>> +<p>This document describes how to build Apache Tika from sources and 
>>how to start using Tika in an application.</p></div>
>> +<div class="section">
>> +<h2>Getting and building the sources<a 
>>name="Getting_and_building_the_sources"></a></h2>
>> +<p>To build Tika from sources you first need to either <a 
>>href="../download.html">download</a> a source release or <a 
>>href="../source-repository.html">checkout</a> the latest sources from 
>>version control.</p>
>> +<p>Once you have the sources, you can build them using the <a 
>>class="externalLink" href="http://maven.apache.org/";>Maven 2</a> build 
>>system. Executing the following command in the base directory will build 
>>the sources and install the resulting artifacts in your local Maven 
>>repository.</p>
>> +<div>
>> +<pre>mvn install</pre></div>
>> +<p>See the Maven documentation for more information about the 
>>available build options.</p>
>> +<p>Note that you need Java 6 or higher to build Tika.</p></div>
>> +<div class="section">
>> +<h2>Build artifacts<a name="Build_artifacts"></a></h2>
>> +<p>The Tika build consists of a number of components and produces the 
>>following main binaries:</p>
>> +<dl>
>> +<dt>tika-core/target/tika-core-*.jar</dt>
>> +<dd> Tika core library. Contains the core interfaces and classes of 
>>Tika, but none of the parser implementations. Depends only on Java 
>>6.</dd>
>> +<dt>tika-parsers/target/tika-parsers-*.jar</dt>
>> +<dd> Tika parsers. Collection of classes that implement the Tika 
>>Parser interface based on various external parser libraries.</dd>
>> +<dt>tika-app/target/tika-app-*.jar</dt>
>> +<dd> Tika application. Combines the above components and all the 
>>external parser libraries into a single runnable jar with a GUI and a 
>>command line interface.</dd>
>> +<dt>tika-bundle/target/tika-bundle-*.jar</dt>
>> +<dd> Tika bundle. An OSGi bundle that combines tika-parsers with 
>>non-OSGified parser libraries to make them easy to deploy in an OSGi 
>>environment.</dd></dl></div>
>> +<div class="section">
>> +<h2>Using Tika as a Maven dependency<a 
>>name="Using_Tika_as_a_Maven_dependency"></a></h2>
>> +<p>The core library, tika-core, contains the key interfaces and 
>>classes of Tika and can be used by itself if you don't need the full set 
>>of parsers from the tika-parsers component. The tika-core dependency 
>>looks like this:</p>
>> +<div>
>> +<pre>  &lt;dependency&gt;
>> +    &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
>> +    &lt;artifactId&gt;tika-core&lt;/artifactId&gt;
>> +    &lt;version&gt;...&lt;/version&gt;
>> +  &lt;/dependency&gt;</pre></div>
>> +<p>If you want to use Tika to parse documents (instead of simply 
>>detecting document types, etc.), you'll want to depend on tika-parsers 
>>instead: </p>
>> +<div>
>> +<pre>  &lt;dependency&gt;
>> +    &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
>> +    &lt;artifactId&gt;tika-parsers&lt;/artifactId&gt;
>> +    &lt;version&gt;...&lt;/version&gt;
>> +  &lt;/dependency&gt;</pre></div>
>> +<p>Note that adding this dependency will introduce a number of 
>>transitive dependencies to your project, including one on tika-core. You 
>>need to make sure that these dependencies won't conflict with your 
>>existing project dependencies. You can use the following command in the 
>>tika-parsers directory to get a full listing of all the dependencies.</p>
>> +<div>
>> +<pre>$ mvn dependency:tree | grep :compile</pre></div></div>
>> +<div class="section">
>> +<h2>Using Tika in an Ant project<a 
>>name="Using_Tika_in_an_Ant_project"></a></h2>
>> +<p>Unless you use a dependency manager tool like <a 
>>class="externalLink" href="http://ant.apache.org/ivy/";>Apache Ivy</a>, 
>>the easiest way to use Tika is to include either the tika-core or the 
>>tika-app jar in your classpath, depending on whether you want just the 
>>core functionality or also all the parser implementations.</p>
>> +<div>
>> +<pre>&lt;classpath&gt;
>> +  ... &lt;!-- your other classpath entries --&gt;
>> +
>> +  &lt;!-- either: --&gt;
>> +  &lt;pathelement 
>>location=&quot;path/to/tika-core-${tika.version}.jar&quot;/&gt;
>> +  &lt;!-- or: --&gt;
>> +  &lt;pathelement 
>>location=&quot;path/to/tika-app-${tika.version}.jar&quot;/&gt;
>> +
>> +&lt;/classpath&gt;</pre></div></div>
>> +<div class="section">
>> +<h2>Using Tika as a command line utility<a 
>>name="Using_Tika_as_a_command_line_utility"></a></h2>
>> +<p>The Tika application jar (tika-app-*.jar) can be used as a command 
>>line utility for extracting text content and metadata from all sorts of 
>>files. This runnable jar contains all the dependencies it needs, so you 
>>don't need to worry about classpath settings to run it.</p>
>> +<p>The usage instructions are shown below.</p>
>> +<div>
>> +<pre>usage: java -jar tika-app.jar [option...] [file|port...]
>> +
>> +Options:
>> +    -?  or --help          Print this usage message
>> +    -v  or --verbose       Print debug level messages
>> +    -V  or --version       Print the Apache Tika version number
>> +
>> +    -g  or --gui           Start the Apache Tika GUI
>> +    -s  or --server        Start the Apache Tika server
>> +    -f  or --fork          Use Fork Mode for out-of-process extraction
>> +
>> +    -x  or --xml           Output XHTML content (default)
>> +    -h  or --html          Output HTML content
>> +    -t  or --text          Output plain text content
>> +    -T  or --text-main     Output plain text content (main content 
>>only)
>> +    -m  or --metadata      Output only metadata
>> +    -j  or --json          Output metadata in JSON
>> +    -y  or --xmp           Output metadata in XMP
>> +    -l  or --language      Output only language
>> +    -d  or --detect        Detect document type
>> +    -eX or --encoding=X    Use output encoding X
>> +    -pX or --password=X    Use document password X
>> +    -z  or --extract       Extract all attachements into current 
>>directory
>> +    --extract-dir=&lt;dir&gt;    Specify target directory for -z
>> +    -r  or --pretty-print  For XML and XHTML outputs, adds newlines and
>> +                           whitespace, for better readability
>> +
>> +    --create-profile=X
>> +         Create NGram profile, where X is a profile name
>> +    --list-parsers
>> +         List the available document parsers
>> +    --list-parser-details
>> +         List the available document parsers, and their supported mime 
>>types
>> +    --list-detectors
>> +         List the available document detectors
>> +    --list-met-models
>> +         List the available metadata models, and their supported keys
>> +    --list-supported-types
>> +         List all known media types and related information
>> +
>> +Description:
>> +    Apache Tika will parse the file(s) specified on the
>> +    command line and output the extracted text content
>> +    or metadata to standard output.
>> +
>> +    Instead of a file name you can also specify the URL
>> +    of a document to be parsed.
>> +
>> +    If no file name or URL is specified (or the special
>> +    name &quot;-&quot; is used), then the standard input stream
>> +    is parsed. If no arguments were given and no input
>> +    data is available, the GUI is started instead.
>> +
>> +- GUI mode
>> +
>> +    Use the &quot;--gui&quot; (or &quot;-g&quot;) option to start the
>> +    Apache Tika GUI. You can drag and drop files from
>> +    a normal file explorer to the GUI window to extract
>> +    text content and metadata from the files.
>> +
>> +- Server mode
>> +
>> +    Use the &quot;--server&quot; (or &quot;-s&quot;) option to start 
>>the
>> +    Apache Tika server. The server will listen to the
>> +    ports you specify as one or more arguments.</pre></div>
>> +<p>You can also use the jar as a component in a Unix pipeline or as an 
>>external tool in many scripting languages.</p>
>> +<div>
>> +<pre># Check if an Internet resource contains a specific keyword
>> +curl http://.../document.doc \
>> +  | java -jar tika-app.jar --text \
>> +  | grep -q keyword</pre></div></div>
>> +      </div>
>> +      <div id="sidebar">
>> +        <div id="navigation">
>> +                    <h5>Apache Tika</h5>
>> +            <ul>
>> +              
>> +    <li class="none">
>> +                    <a href="../index.html">Introduction</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../download.html">Download</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../contribute.html">Contribute</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../mail-lists.html">Mailing Lists</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="http://wiki.apache.org/tika/"; 
>>class="externalLink">Tika Wiki</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a 
>>href="https://issues.apache.org/jira/browse/TIKA"; 
>>class="externalLink">Issue Tracker</a>
>> +          </li>
>> +          </ul>
>> +              <h5>Documentation</h5>
>> +            <ul>
>> +              
>> +          
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="expanded">
>> +                    <a href="../1.5/index.html">Apache Tika 1.5</a>
>> +                  <ul>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/gettingstarted.html">Getting 
>>Started</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/formats.html">Supported Formats</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/parser.html">Parser API</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/parser_guide.html">Parser 5min 
>>Quick Start Guide</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/detection.html">Content and 
>>Language Detection</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/api/">API Documentation</a>
>> +          </li>
>> +              </ul>
>> +        </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.4/index.html">Apache Tika 1.4</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.3/index.html">Apache Tika 1.3</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.2/index.html">Apache Tika 1.2</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.1/index.html">Apache Tika 1.1</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.0/index.html">Apache Tika 1.0</a>
>> +                </li>
>> +          </ul>
>> +              <h5>The Apache Software Foundation</h5>
>> +            <ul>
>> +              
>> +    <li class="none">
>> +                    <a href="http://www.apache.org/foundation/"; 
>>class="externalLink">About</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="http://www.apache.org/licenses/"; 
>>class="externalLink">License</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="http://www.apache.org/security/"; 
>>class="externalLink">Security</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a 
>>href="http://www.apache.org/foundation/sponsorship.html"; 
>>class="externalLink">Sponsorship</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a 
>>href="http://www.apache.org/foundation/thanks.html"; 
>>class="externalLink">Thanks</a>
>> +          </li>
>> +          </ul>
>> +      
>> +          <div id="search">
>> +            <h5>Search with Apache Solr</h5>
>> +            <form action="http://search.lucidimagination.com/p:tika";
>> +                  method="get" id="searchform">
>> +              <input type="text" id="query" name="q"/>
>> +              <select name="searchProvider" id="searchProvider">
>> +                <option value="any">provider</option>
>> +                <option value="lucid">Lucid Find</option>
>> +                <option value="sl">Search-Lucene</option>
>> +              </select>
>> +              <input type="submit" id="submit" value="Search" 
>>name="Search"
>> +                     onclick="selectProvider(this.form)"/>
>> +            </form>
>> +          </div>
>> +
>> +          <div id="bookpromo">
>> +            <h5>Books about Tika</h5>
>> +            <p>
>> +              <a href="http://manning.com/mattmann/"; title="Tika in 
>>Action"
>> +                ><img src="../mattmann_cover150.jpg"
>> +                      width="150" height="186"/></a>
>> +            </p>
>> +          </div>
>> +        </div>
>> +      </div>
>> +      <div id="footer">
>> +        <p>
>> +          Copyright &#169; 2014
>> +          <a href="http://www.apache.org/";>The Apache Software 
>>Foundation</a>.
>> +          Site powered by <a href="http://maven.apache.org/";>Apache 
>>Maven</a>. 
>> +          Search powered by
>> +          <a href="http://www.lucidimagination.com";>Lucid 
>>Imagination</a>
>> +          and <a href="http://sematext.com";>Sematext</a>.
>> +          <br/>
>> +          Apache Tika, Tika, Apache, the Apache feather logo, and the 
>>Apache
>> +          Tika project logo are trademarks of The Apache Software 
>>Foundation.
>> +        </p>
>> +      </div>
>> +    </div>
>> +  </body>
>> +</html>
>> 
>> Added: tika/site/publish/1.6/parser.html
>> URL: 
>>http://svn.apache.org/viewvc/tika/site/publish/1.6/parser.html?rev=162276
>>2&view=auto
>> 
>>=========================================================================
>>=====
>> --- tika/site/publish/1.6/parser.html (added)
>> +++ tika/site/publish/1.6/parser.html Fri Sep  5 19:14:58 2014
>> @@ -0,0 +1,372 @@
>> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> +          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>> +
>> +<!--
>> +  Licensed to the Apache Software Foundation (ASF) under one
>> +  or more contributor license agreements.  See the NOTICE file
>> +  distributed with this work for additional information
>> +  regarding copyright ownership.  The ASF licenses this file
>> +  to you under the Apache License, Version 2.0 (the
>> +  "License"); you may not use this file except in compliance
>> +  with the License.  You may obtain a copy of the License at
>> +
>> +    http://www.apache.org/licenses/LICENSE-2.0
>> + 
>> +  Unless required by applicable law or agreed to in writing,
>> +  software distributed under the License is distributed on an
>> +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
>> +  KIND, either express or implied.  See the License for the
>> +  specific language governing permissions and limitations
>> +  under the License.
>> +-->
>> +
>> +
>> +
>> +
>> +
>> +
>> +
>> +<html xmlns="http://www.w3.org/1999/xhtml";>
>> +  <head>
>> +    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" 
>>/>
>> +    <title>Apache Tika - The Parser interface</title>
>> +    <style type="text/css" media="all">
>> +      @import url("../css/site.css");
>> +    </style>
>> +    <link rel="icon" type="image/png" href="../tikaNoText16.png" />
>> +    <script type="text/javascript">
>> +      function selectProvider(form) {
>> +        provider = form.elements['searchProvider'].value;
>> +        if (provider == "any") {
>> +          if (Math.random() > 0.5) {
>> +            provider = "lucid";
>> +          } else {
>> +            provider = "sl";
>> +          }
>> +        }
>> +        if (provider == "lucid") {
>> +          form.action = "http://find.searchhub.org/p:tika";;
>> +        } else if (provider == "sl") {
>> +          form.action = "http://search-lucene.com/tika";;
>> +        }
>> +        days = 90;
>> +        date = new Date();
>> +        date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
>> +        expires = "; expires=" + date.toGMTString();
>> +        document.cookie = "searchProvider=" + provider + expires + "; 
>>path=/";
>> +      }
>> +      function initProvider() {
>> +        if (document.cookie.length>0) {
>> +          cStart=document.cookie.indexOf("searchProvider=");
>> +          if (cStart!=-1) {
>> +            cStart=cStart + "searchProvider=".length;
>> +            cEnd=document.cookie.indexOf(";", cStart);
>> +            if (cEnd==-1) {
>> +              cEnd=document.cookie.length;
>> +            }
>> +            provider = 
>>unescape(document.cookie.substring(cStart,cEnd));
>> +            
>>document.forms['searchform'].elements['searchProvider'].value = provider;
>> +          }
>> +        }
>> +        document.forms['searchform'].elements['q'].focus();
>> +      }
>> +    </script>
>> +  </head>
>> +  <body onLoad="initProvider();">
>> +    <div id="body">
>> +      <div id="banner">
>> +        <a href="http://tika.apache.org"; id="bannerLeft" title="Apache 
>>Tika"
>> +          ><img src="http://tika.apache.org/tika.png"; alt="Apache Tika"
>> +                width="292" height="100"/></a>
>> +        <a href="http://www.apache.org/"; id="bannerRight"
>> +           title="The Apache Software Foundation"
>> +          ><img src="http://tika.apache.org/asf-logo.gif"; alt="The 
>>Apache Software Foundation"
>> +                width="387" height="100"/></a>
>> +      </div>
>> +      <div id="content">
>> +        <!-- Licensed to the Apache Software Foundation (ASF) under 
>>one or more --><!-- contributor license agreements.  See the NOTICE file 
>>distributed with --><!-- this work for additional information regarding 
>>copyright ownership. --><!-- The ASF licenses this file to You under the 
>>Apache License, Version 2.0 --><!-- (the "License"); you may not use 
>>this file except in compliance with --><!-- the License.  You may obtain 
>>a copy of the License at --><!--  --><!-- 
>>http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless 
>>required by applicable law or agreed to in writing, software --><!-- 
>>distributed under the License is distributed on an "AS IS" BASIS, 
>>--><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>implied. --><!-- See the License for the specific language governing 
>>permissions and --><!-- limitations under the License. --><div 
>>class="section">
>> +<h2>The Parser interface<a name="The_Parser_interface"></a></h2>
>> +<p>The <a 
>>href="./api/org/apache/tika/parser/Parser.html">org.apache.tika.parser.Pa
>>rser</a> interface is the key concept of Apache Tika. It hides the 
>>complexity of different file formats and parsing libraries while 
>>providing a simple and powerful mechanism for client applications to 
>>extract structured text content and metadata from all sorts of 
>>documents. All this is achieved with a single method:</p>
>> +<div>
>> +<pre>void parse(
>> +    InputStream stream, ContentHandler handler, Metadata metadata,
>> +    ParseContext context) throws IOException, SAXException, 
>>TikaException;</pre></div>
>> +<p>The <tt>parse</tt> method takes the document to be parsed and 
>>related metadata as input and outputs the results as XHTML SAX events 
>>and extra metadata. The parse context argument is used to specify 
>>context information (like the current local) that is not related to any 
>>individual document. The main criteria that lead to this design were:</p>
>> +<dl>
>> +<dt>Streamed parsing</dt>
>> +<dd>The interface should require neither the client application nor 
>>the parser implementation to keep the full document content in memory or 
>>spooled to disk. This allows even huge documents to be parsed without 
>>excessive resource requirements.</dd>
>> +<dt>Structured content</dt>
>> +<dd>A parser implementation should be able to include structural 
>>information (headings, links, etc.) in the extracted content. A client 
>>application can use this information for example to better judge the 
>>relevance of different parts of the parsed document.</dd>
>> +<dt>Input metadata</dt>
>> +<dd>A client application should be able to include metadata like the 
>>file name or declared content type with the document to be parsed. The 
>>parser implementation can use this information to better guide the 
>>parsing process.</dd>
>> +<dt>Output metadata</dt>
>> +<dd>A parser implementation should be able to return document metadata 
>>in addition to document content. Many document formats contain metadata 
>>like the name of the author that may be useful to client 
>>applications.</dd>
>> +<dt>Context sensitivity</dt>
>> +<dd>While the default settings and behaviour of Tika parsers should 
>>work well for most use cases, there are still situations where more 
>>fine-grained control over the parsing process is desirable. It should be 
>>easy to inject such context-specific information to the parsing process 
>>without breaking the layers of abstraction.</dd></dl>
>> +<p>These criteria are reflected in the arguments of the <tt>parse</tt> 
>>method.</p>
>> +<div class="section">
>> +<h3>Document input stream<a name="Document_input_stream"></a></h3>
>> +<p>The first argument is an <a class="externalLink" 
>>href="http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html";>
>>InputStream</a> for reading the document to be parsed.</p>
>> +<p>If this document stream can not be read, then parsing stops and the 
>>thrown <a class="externalLink" 
>>href="http://docs.oracle.com/javase/6/docs/api/java/io/IOException.html";>
>>IOException</a> is passed up to the client application. If the stream 
>>can be read but not parsed (for example if the document is corrupted), 
>>then the parser throws a <a 
>>href="./api/org/apache/tika/exception/TikaException.html">TikaException</
>>a>.</p>
>> +<p>The parser implementation will consume this stream but <i>will not 
>>close it</i>. Closing the stream is the responsibility of the client 
>>application that opened it in the first place. The recommended pattern 
>>for using streams with the <tt>parse</tt> method is:</p>
>> +<div>
>> +<pre>InputStream stream = ...;      // open the stream
>> +try {
>> +    parser.parse(stream, ...); // parse the stream
>> +} finally {
>> +    stream.close();            // close the stream
>> +}</pre></div>
>> +<p>Some document formats like the OLE2 Compound Document Format used 
>>by Microsoft Office are best parsed as random access files. In such 
>>cases the content of the input stream is automatically spooled to a 
>>temporary file that gets removed once parsed. A future version of Tika 
>>may make it possible to avoid this extra file if the input document is 
>>already a file in the local file system. See <a class="externalLink" 
>>href="https://issues.apache.org/jira/browse/TIKA-153";>TIKA-153</a> for 
>>the status of this feature request.</p></div>
>> +<div class="section">
>> +<h3>XHTML SAX events<a name="XHTML_SAX_events"></a></h3>
>> +<p>The parsed content of the document stream is returned to the client 
>>application as a sequence of XHTML SAX events. XHTML is used to express 
>>structured content of the document and SAX events enable streamed 
>>processing. Note that the XHTML format is used here only to convey 
>>structural information, not to render the documents for browsing!</p>
>> +<p>The XHTML SAX events produced by the parser implementation are sent 
>>to a <a class="externalLink" 
>>href="http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler
>>.html">ContentHandler</a> instance given to the <tt>parse</tt> method. 
>>If this the content handler fails to process an event, then parsing 
>>stops and the thrown <a class="externalLink" 
>>href="http://docs.oracle.com/javase/6/docs/api/org/xml/sax/SAXException.h
>>tml">SAXException</a> is passed up to the client application.</p>
>> +<p>The overall structure of the generated event stream is (with 
>>indenting added for clarity):</p>
>> +<div>
>> +<pre>&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
>> +  &lt;head&gt;
>> +    &lt;title&gt;...&lt;/title&gt;
>> +  &lt;/head&gt;
>> +  &lt;body&gt;
>> +    ...
>> +  &lt;/body&gt;
>> +&lt;/html&gt;</pre></div>
>> +<p>Parser implementations typically use the <a 
>>href="./apidocs/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLConten
>>tHandler</a> utility class to generate the XHTML output.</p>
>> +<p>Dealing with the raw SAX events can be a bit complex, so Apache 
>>Tika comes with a number of utility classes that can be used to process 
>>and convert the event stream to other representations.</p>
>> +<p>For example, the <a 
>>href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandl
>>er</a> class can be used to extract just the body part of the XHTML 
>>output and feed it either as SAX events to another content handler or as 
>>characters to an output stream, a writer, or simply a string. The 
>>following code snippet parses a document from the standard input stream 
>>and outputs the extracted text content to standard output:</p>
>> +<div>
>> +<pre>ContentHandler handler = new BodyContentHandler(System.out);
>> +parser.parse(System.in, handler, ...);</pre></div>
>> +<p>Another useful class is <a 
>>href="./api/org/apache/tika/parser/ParsingReader.html">ParsingReader</a> 
>>that uses a background thread to parse the document and returns the 
>>extracted text content as a character stream:</p>
>> +<div>
>> +<pre>InputStream stream = ...; // the document to be parsed
>> +Reader reader = new ParsingReader(parser, stream, ...);
>> +try {
>> +    ...;                  // read the document text using the reader
>> +} finally {
>> +    reader.close();       // the document stream is closed 
>>automatically
>> +}</pre></div></div>
>> +<div class="section">
>> +<h3>Document metadata<a name="Document_metadata"></a></h3>
>> +<p>The third argument to the <tt>parse</tt> method is used to pass 
>>document metadata both in and out of the parser. Document metadata is 
>>expressed as an <a 
>>href="./api/org/apache/tika/metadata/Metadata.html">Metadata</a> 
>>object.</p>
>> +<p>The following are some of the more interesting metadata 
>>properties:</p>
>> +<dl>
>> +<dt>Metadata.RESOURCE_NAME_KEY</dt>
>> +<dd>The name of the file or resource that contains the document.
>> +<p>A client application can set this property to allow the parser to 
>>use file name heuristics to determine the format of the document.</p>
>> +<p>The parser implementation may set this property if the file format 
>>contains the canonical name of the file (for example the Gzip format has 
>>a slot for the file name).</p></dd>
>> +<dt>Metadata.CONTENT_TYPE</dt>
>> +<dd>The declared content type of the document.
>> +<p>A client application can set this property based on for example a 
>>HTTP Content-Type header. The declared content type may help the parser 
>>to correctly interpret the document.</p>
>> +<p>The parser implementation sets this property to the content type 
>>according to which the document was parsed.</p></dd>
>> +<dt>Metadata.TITLE</dt>
>> +<dd>The title of the document.
>> +<p>The parser implementation sets this property if the document format 
>>contains an explicit title field.</p></dd>
>> +<dt>Metadata.AUTHOR</dt>
>> +<dd>The name of the author of the document.
>> +<p>The parser implementation sets this property if the document format 
>>contains an explicit author field.</p></dd></dl>
>> +<p>Note that metadata handling is still being discussed by the Tika 
>>development team, and it is likely that there will be some (backwards 
>>incompatible) changes in metadata handling before Tika 1.0.</p></div>
>> +<div class="section">
>> +<h3>Parse context<a name="Parse_context"></a></h3>
>> +<p>The final argument to the <tt>parse</tt> method is used to inject 
>>context-specific information to the parsing process. This is useful for 
>>example when dealing with locale-specific date and number formats in 
>>Microsoft Excel spreadsheets. Another important use of the parse context 
>>is passing in the delegate parser instance to be used by two-phase 
>>parsers like the <a 
>>href="./api/org/apache/parser/pkg/PackageParser.html">PackageParser</a> 
>>subclasses. Some parser classes allow customization of the parsing 
>>process through strategy objects in the parse context.</p></div>
>> +<div class="section">
>> +<h3>Parser implementations<a name="Parser_implementations"></a></h3>
>> +<p>Apache Tika comes with a number of parser classes for parsing <a 
>>href="./formats.html">various document formats</a>. You can also extend 
>>Tika with your own parsers, and of course any contributions to Tika are 
>>warmly welcome.</p>
>> +<p>The goal of Tika is to reuse existing parser libraries like <a 
>>class="externalLink" href="http://www.pdfbox.org/";>PDFBox</a> or <a 
>>class="externalLink" href="http://poi.apache.org/";>Apache POI</a> as 
>>much as possible, and so most of the parser classes in Tika are adapters 
>>to such external libraries.</p>
>> +<p>Tika also contains some general purpose parser implementations that 
>>are not targeted at any specific document formats. The most notable of 
>>these is the <a 
>>href="./apidocs/org/apache/tika/parser/AutoDetectParser.html">AutoDetectP
>>arser</a> class that encapsulates all Tika functionality into a single 
>>parser that can handle any types of documents. This parser will 
>>automatically determine the type of the incoming document based on 
>>various heuristics and will then parse the document 
>>accordingly.</p></div></div>
>> +      </div>
>> +      <div id="sidebar">
>> +        <div id="navigation">
>> +                    <h5>Apache Tika</h5>
>> +            <ul>
>> +              
>> +    <li class="none">
>> +                    <a href="../index.html">Introduction</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../download.html">Download</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../contribute.html">Contribute</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../mail-lists.html">Mailing Lists</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="http://wiki.apache.org/tika/"; 
>>class="externalLink">Tika Wiki</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a 
>>href="https://issues.apache.org/jira/browse/TIKA"; 
>>class="externalLink">Issue Tracker</a>
>> +          </li>
>> +          </ul>
>> +              <h5>Documentation</h5>
>> +            <ul>
>> +              
>> +          
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="expanded">
>> +                    <a href="../1.5/index.html">Apache Tika 1.5</a>
>> +                  <ul>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/gettingstarted.html">Getting 
>>Started</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/formats.html">Supported Formats</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/parser.html">Parser API</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/parser_guide.html">Parser 5min 
>>Quick Start Guide</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/detection.html">Content and 
>>Language Detection</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/api/">API Documentation</a>
>> +          </li>
>> +              </ul>
>> +        </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.4/index.html">Apache Tika 1.4</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.3/index.html">Apache Tika 1.3</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.2/index.html">Apache Tika 1.2</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.1/index.html">Apache Tika 1.1</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.0/index.html">Apache Tika 1.0</a>
>> +                </li>
>> +          </ul>
>> +              <h5>The Apache Software Foundation</h5>
>> +            <ul>
>> +              
>> +    <li class="none">
>> +                    <a href="http://www.apache.org/foundation/"; 
>>class="externalLink">About</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="http://www.apache.org/licenses/"; 
>>class="externalLink">License</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="http://www.apache.org/security/"; 
>>class="externalLink">Security</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a 
>>href="http://www.apache.org/foundation/sponsorship.html"; 
>>class="externalLink">Sponsorship</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a 
>>href="http://www.apache.org/foundation/thanks.html"; 
>>class="externalLink">Thanks</a>
>> +          </li>
>> +          </ul>
>> +      
>> +          <div id="search">
>> +            <h5>Search with Apache Solr</h5>
>> +            <form action="http://search.lucidimagination.com/p:tika";
>> +                  method="get" id="searchform">
>> +              <input type="text" id="query" name="q"/>
>> +              <select name="searchProvider" id="searchProvider">
>> +                <option value="any">provider</option>
>> +                <option value="lucid">Lucid Find</option>
>> +                <option value="sl">Search-Lucene</option>
>> +              </select>
>> +              <input type="submit" id="submit" value="Search" 
>>name="Search"
>> +                     onclick="selectProvider(this.form)"/>
>> +            </form>
>> +          </div>
>> +
>> +          <div id="bookpromo">
>> +            <h5>Books about Tika</h5>
>> +            <p>
>> +              <a href="http://manning.com/mattmann/"; title="Tika in 
>>Action"
>> +                ><img src="../mattmann_cover150.jpg"
>> +                      width="150" height="186"/></a>
>> +            </p>
>> +          </div>
>> +        </div>
>> +      </div>
>> +      <div id="footer">
>> +        <p>
>> +          Copyright &#169; 2014
>> +          <a href="http://www.apache.org/";>The Apache Software 
>>Foundation</a>.
>> +          Site powered by <a href="http://maven.apache.org/";>Apache 
>>Maven</a>. 
>> +          Search powered by
>> +          <a href="http://www.lucidimagination.com";>Lucid 
>>Imagination</a>
>> +          and <a href="http://sematext.com";>Sematext</a>.
>> +          <br/>
>> +          Apache Tika, Tika, Apache, the Apache feather logo, and the 
>>Apache
>> +          Tika project logo are trademarks of The Apache Software 
>>Foundation.
>> +        </p>
>> +      </div>
>> +    </div>
>> +  </body>
>> +</html>
>> 
>> Added: tika/site/publish/1.6/parser_guide.html
>> URL: 
>>http://svn.apache.org/viewvc/tika/site/publish/1.6/parser_guide.html?rev=
>>1622762&view=auto
>> 
>>=========================================================================
>>=====
>> --- tika/site/publish/1.6/parser_guide.html (added)
>> +++ tika/site/publish/1.6/parser_guide.html Fri Sep  5 19:14:58 2014
>> @@ -0,0 +1,373 @@
>> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> +          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>> +
>> +<!--
>> +  Licensed to the Apache Software Foundation (ASF) under one
>> +  or more contributor license agreements.  See the NOTICE file
>> +  distributed with this work for additional information
>> +  regarding copyright ownership.  The ASF licenses this file
>> +  to you under the Apache License, Version 2.0 (the
>> +  "License"); you may not use this file except in compliance
>> +  with the License.  You may obtain a copy of the License at
>> +
>> +    http://www.apache.org/licenses/LICENSE-2.0
>> + 
>> +  Unless required by applicable law or agreed to in writing,
>> +  software distributed under the License is distributed on an
>> +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
>> +  KIND, either express or implied.  See the License for the
>> +  specific language governing permissions and limitations
>> +  under the License.
>> +-->
>> +
>> +
>> +
>> +
>> +
>> +
>> +
>> +<html xmlns="http://www.w3.org/1999/xhtml";>
>> +  <head>
>> +    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" 
>>/>
>> +    <title>Apache Tika - Get Tika parsing up and running in 5 
>>minutes</title>
>> +    <style type="text/css" media="all">
>> +      @import url("../css/site.css");
>> +    </style>
>> +    <link rel="icon" type="image/png" href="../tikaNoText16.png" />
>> +    <script type="text/javascript">
>> +      function selectProvider(form) {
>> +        provider = form.elements['searchProvider'].value;
>> +        if (provider == "any") {
>> +          if (Math.random() > 0.5) {
>> +            provider = "lucid";
>> +          } else {
>> +            provider = "sl";
>> +          }
>> +        }
>> +        if (provider == "lucid") {
>> +          form.action = "http://find.searchhub.org/p:tika";;
>> +        } else if (provider == "sl") {
>> +          form.action = "http://search-lucene.com/tika";;
>> +        }
>> +        days = 90;
>> +        date = new Date();
>> +        date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
>> +        expires = "; expires=" + date.toGMTString();
>> +        document.cookie = "searchProvider=" + provider + expires + "; 
>>path=/";
>> +      }
>> +      function initProvider() {
>> +        if (document.cookie.length>0) {
>> +          cStart=document.cookie.indexOf("searchProvider=");
>> +          if (cStart!=-1) {
>> +            cStart=cStart + "searchProvider=".length;
>> +            cEnd=document.cookie.indexOf(";", cStart);
>> +            if (cEnd==-1) {
>> +              cEnd=document.cookie.length;
>> +            }
>> +            provider = 
>>unescape(document.cookie.substring(cStart,cEnd));
>> +            
>>document.forms['searchform'].elements['searchProvider'].value = provider;
>> +          }
>> +        }
>> +        document.forms['searchform'].elements['q'].focus();
>> +      }
>> +    </script>
>> +  </head>
>> +  <body onLoad="initProvider();">
>> +    <div id="body">
>> +      <div id="banner">
>> +        <a href="http://tika.apache.org"; id="bannerLeft" title="Apache 
>>Tika"
>> +          ><img src="http://tika.apache.org/tika.png"; alt="Apache Tika"
>> +                width="292" height="100"/></a>
>> +        <a href="http://www.apache.org/"; id="bannerRight"
>> +           title="The Apache Software Foundation"
>> +          ><img src="http://tika.apache.org/asf-logo.gif"; alt="The 
>>Apache Software Foundation"
>> +                width="387" height="100"/></a>
>> +      </div>
>> +      <div id="content">
>> +        <!-- Licensed to the Apache Software Foundation (ASF) under 
>>one or more --><!-- contributor license agreements.  See the NOTICE file 
>>distributed with --><!-- this work for additional information regarding 
>>copyright ownership. --><!-- The ASF licenses this file to You under the 
>>Apache License, Version 2.0 --><!-- (the "License"); you may not use 
>>this file except in compliance with --><!-- the License.  You may obtain 
>>a copy of the License at --><!--  --><!-- 
>>http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless 
>>required by applicable law or agreed to in writing, software --><!-- 
>>distributed under the License is distributed on an "AS IS" BASIS, 
>>--><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>implied. --><!-- See the License for the specific language governing 
>>permissions and --><!-- limitations under the License. --><div 
>>class="section">
>> +<h2>Get Tika parsing up and running in 5 minutes<a 
>>name="Get_Tika_parsing_up_and_running_in_5_minutes"></a></h2>
>> +<p>This page is a quick start guide showing how to add a new parser to 
>>Apache Tika. Following the simple steps listed below your new parser can 
>>be running in only 5 minutes.</p>
>> +<ul>
>> +<li><a href="#Get_Tika_parsing_up_and_running_in_5_minutes">Get Tika 
>>parsing up and running in 5 minutes</a>
>> +<ul>
>> +<li><a href="#Getting_Started">Getting Started</a></li>
>> +<li><a href="#Add_your_MIME-Type">Add your MIME-Type</a></li>
>> +<li><a href="#Create_your_Parser_class">Create your Parser 
>>class</a></li>
>> +<li><a href="#List_the_new_parser">List the new 
>>parser</a></li></ul></li></ul>
>> +<div class="section">
>> +<h3><a name="Getting_Started">Getting Started</a></h3>
>> +<p>The <a href="./gettingstarted.html">Getting Started</a> document 
>>describes how to build Apache Tika from sources and how to start using 
>>Tika in an application. Pay close attention and follow the instructions 
>>in the &quot;Getting and building the sources&quot; section.</p></div>
>> +<div class="section">
>> +<h3><a name="Add_your_MIME-Type">Add your MIME-Type</a></h3>
>> +<p>Tika loads the core, standard MIME-Types from the file 
>>&quot;org/apache/tika/mime/tika-mimetypes.xml&quot;, which comes from <a 
>>class="externalLink" 
>>href="http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resou
>>rces/org/apache/tika/mime/tika-mimetypes.xml">tika-core/src/main/resource
>>s/org/apache/tika/mime/tika-mimetypes.xml</a> . If your new MIME-Type is 
>>a standard one which is missing from Tika, submit a patch for this 
>>file!</p>
>> +<p>If your MIME-Type needs adding, create a new file 
>>&quot;org/apache/tika/mime/custom-mimetypes.xml&quot; in your codebase. 
>>You should add to it something like this:</p>
>> +<div>
>> +<pre> &lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
>> + &lt;mime-info&gt;
>> +   &lt;mime-type type=&quot;application/hello&quot;&gt;
>> +          &lt;glob pattern=&quot;*.hi&quot;/&gt;
>> +   &lt;/mime-type&gt;
>> + &lt;/mime-info&gt;</pre></div></div>
>> +<div class="section">
>> +<h3><a name="Create_your_Parser_class">Create your Parser 
>>class</a></h3>
>> +<p>Now, you need to create your new parser. This is a class that must 
>>implement the Parser interface offered by Tika. Instead of implementing 
>>the Parser interface directly, it is recommended that you extend the 
>>abstract class AbstractParser if possible. AbstractParser handles 
>>translating between API changes for you.</p>
>> +<p>A very simple Tika Parser looks like this:</p>
>> +<div>
>> +<pre>/*
>> + * Licensed to the Apache Software Foundation (ASF) under one or more
>> + * contributor license agreements.  See the NOTICE file distributed 
>>with
>> + * this work for additional information regarding copyright ownership.
>> + * The ASF licenses this file to You under the Apache License, Version 
>>2.0
>> + * (the &quot;License&quot;); you may not use this file except in 
>>compliance with
>> + * the License.  You may obtain a copy of the License at
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an &quot;AS 
>>IS&quot; BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
>>implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + * 
>> + * @Author: Arturo Beltran
>> + */
>> +package org.apache.tika.parser.hello;
>> +
>> +import java.io.IOException;
>> +import java.io.InputStream;
>> +import java.util.Collections;
>> +import java.util.Set;
>> +
>> +import org.apache.tika.exception.TikaException;
>> +import org.apache.tika.metadata.Metadata;
>> +import org.apache.tika.mime.MediaType;
>> +import org.apache.tika.parser.ParseContext;
>> +import org.apache.tika.parser.AbstractParser;
>> +import org.apache.tika.sax.XHTMLContentHandler;
>> +import org.xml.sax.ContentHandler;
>> +import org.xml.sax.SAXException;
>> +
>> +public class HelloParser extends AbstractParser {
>> +
>> +        private static final Set&lt;MediaType&gt; SUPPORTED_TYPES = 
>>Collections.singleton(MediaType.application(&quot;hello&quot;));
>> +        public static final String HELLO_MIME_TYPE = 
>>&quot;application/hello&quot;;
>> +        
>> +        public Set&lt;MediaType&gt; getSupportedTypes(ParseContext 
>>context) {
>> +                return SUPPORTED_TYPES;
>> +        }
>> +
>> +        public void parse(
>> +                        InputStream stream, ContentHandler handler,
>> +                        Metadata metadata, ParseContext context)
>> +                        throws IOException, SAXException, 
>>TikaException {
>> +
>> +                metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
>> +                metadata.set(&quot;Hello&quot;, &quot;World&quot;);
>> +
>> +                XHTMLContentHandler xhtml = new 
>>XHTMLContentHandler(handler, metadata);
>> +                xhtml.startDocument();
>> +                xhtml.endDocument();
>> +        }
>> +}</pre></div>
>> +<p>Pay special attention to the definition of the SUPPORTED_TYPES 
>>static class field in the parser class that defines what MIME-Types it 
>>supports. If your MIME-Types aren't standard ones, ensure you listed 
>>them in a &quot;custom-mimetypes.xml&quot; file so that Tika knows about 
>>them (see above).</p>
>> +<p>Is in the &quot;parse&quot; method where you will do all your work. 
>>This is, extract the information of the resource and then set the 
>>metadata.</p></div>
>> +<div class="section">
>> +<h3><a name="List_the_new_parser">List the new parser</a></h3>
>> +<p>Finally, you should explicitly tell the AutoDetectParser to include 
>>your new parser. This step is only needed if you want to use the 
>>AutoDetectParser functionality. If you figure out the correct parser in 
>>a different way, it isn't needed. </p>
>> +<p>List your new parser in: <a class="externalLink" 
>>href="http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/re
>>sources/META-INF/services/org.apache.tika.parser.Parser">tika-parsers/src
>>/main/resources/META-INF/services/org.apache.tika.parser.Parser</a></p></
>>div></div>
>> +      </div>
>> +      <div id="sidebar">
>> +        <div id="navigation">
>> +                    <h5>Apache Tika</h5>
>> +            <ul>
>> +              
>> +    <li class="none">
>> +                    <a href="../index.html">Introduction</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../download.html">Download</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../contribute.html">Contribute</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="../mail-lists.html">Mailing Lists</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="http://wiki.apache.org/tika/"; 
>>class="externalLink">Tika Wiki</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a 
>>href="https://issues.apache.org/jira/browse/TIKA"; 
>>class="externalLink">Issue Tracker</a>
>> +          </li>
>> +          </ul>
>> +              <h5>Documentation</h5>
>> +            <ul>
>> +              
>> +          
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="expanded">
>> +                    <a href="../1.5/index.html">Apache Tika 1.5</a>
>> +                  <ul>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/gettingstarted.html">Getting 
>>Started</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/formats.html">Supported Formats</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/parser.html">Parser API</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/parser_guide.html">Parser 5min 
>>Quick Start Guide</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/detection.html">Content and 
>>Language Detection</a>
>> +          </li>
>> +                  
>> +    <li class="none">
>> +                    <a href="../1.5/api/">API Documentation</a>
>> +          </li>
>> +              </ul>
>> +        </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.4/index.html">Apache Tika 1.4</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.3/index.html">Apache Tika 1.3</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.2/index.html">Apache Tika 1.2</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.1/index.html">Apache Tika 1.1</a>
>> +                </li>
>> +              
>> +                
>> +                    
>> +                  
>> +                  
>> +                  
>> +                  
>> +                  
>> +              
>> +        <li class="collapsed">
>> +                    <a href="../1.0/index.html">Apache Tika 1.0</a>
>> +                </li>
>> +          </ul>
>> +              <h5>The Apache Software Foundation</h5>
>> +            <ul>
>> +              
>> +    <li class="none">
>> +                    <a href="http://www.apache.org/foundation/"; 
>>class="externalLink">About</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="http://www.apache.org/licenses/"; 
>>class="externalLink">License</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a href="http://www.apache.org/security/"; 
>>class="externalLink">Security</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a 
>>href="http://www.apache.org/foundation/sponsorship.html"; 
>>class="externalLink">Sponsorship</a>
>> +          </li>
>> +              
>> +    <li class="none">
>> +                    <a 
>>href="http://www.apache.org/foundation/thanks.html"; 
>>class="externalLink">Thanks</a>
>> +          </li>
>> +          </ul>
>> +      
>> +          <div id="search">
>> +            <h5>Search with Apache Solr</h5>
>> +            <form action="http://search.lucidimagination.com/p:tika";
>> +                  method="get" id="searchform">
>> +              <input type="text" id="query" name="q"/>
>> +              <select name="searchProvider" id="searchProvider">
>> +                <option value="any">provider</option>
>> +                <option value="lucid">Lucid Find</option>
>> +                <option value="sl">Search-Lucene</option>
>> +              </select>
>> +              <input type="submit" id="submit" value="Search" 
>>name="Search"
>> +                     onclick="selectProvider(this.form)"/>
>> +            </form>
>> +          </div>
>> +
>> +          <div id="bookpromo">
>> +            <h5>Books about Tika</h5>
>> +            <p>
>> +              <a href="http://manning.com/mattmann/"; title="Tika in 
>>Action"
>> +                ><img src="../mattmann_cover150.jpg"
>> +                      width="150" height="186"/></a>
>> +            </p>
>> +          </div>
>> +        </div>
>> +      </div>
>> +      <div id="footer">
>> +        <p>
>> +          Copyright &#169; 2014
>> +          <a href="http://www.apache.org/";>The Apache Software 
>>Foundation</a>.
>> +          Site powered by <a href="http://maven.apache.org/";>Apache 
>>Maven</a>. 
>> +          Search powered by
>> +          <a href="http://www.lucidimagination.com";>Lucid 
>>Imagination</a>
>> +          and <a href="http://sematext.com";>Sematext</a>.
>> +          <br/>
>> +          Apache Tika, Tika, Apache, the Apache feather logo, and the 
>>Apache
>> +          Tika project logo are trademarks of The Apache Software 
>>Foundation.
>> +        </p>
>> +      </div>
>> +    </div>
>> +  </body>
>> +</html>
>> 
>> 

Reply via email to