Nick I'm working on this cam u hold off? Sent from my iPhone
> On Sep 5, 2014, at 12:15 PM, "n...@apache.org" <n...@apache.org> wrote: > > Author: nick > Date: Fri Sep 5 19:14:58 2014 > New Revision: 1622762 > > URL: http://svn.apache.org/r1622762 > Log: > Republish the site > > Added: > tika/site/publish/1.6/detection.html > tika/site/publish/1.6/gettingstarted.html > tika/site/publish/1.6/parser.html > tika/site/publish/1.6/parser_guide.html > tika/site/publish/1.7/ > tika/site/publish/1.7/examples.html > tika/site/publish/1.7/formats.html > Modified: > tika/site/publish/1.4/gettingstarted.html > tika/site/publish/1.5/gettingstarted.html > tika/site/publish/1.6/formats.html > tika/site/publish/index.html > > Modified: tika/site/publish/1.4/gettingstarted.html > URL: > http://svn.apache.org/viewvc/tika/site/publish/1.4/gettingstarted.html?rev=1622762&r1=1622761&r2=1622762&view=diff > ============================================================================== > --- tika/site/publish/1.4/gettingstarted.html (original) > +++ tika/site/publish/1.4/gettingstarted.html Fri Sep 5 19:14:58 2014 > @@ -94,13 +94,13 @@ > <div> > <pre>mvn install</pre></div> > <p>See the Maven documentation for more information about the available build > options.</p> > -<p>Note that you need Java 5 or higher to build Tika.</p></div> > +<p>Note that you need Java 6 or higher to build Tika.</p></div> > <div class="section"> > <h2>Build artifacts<a name="Build_artifacts"></a></h2> > <p>The Tika build consists of a number of components and produces the > following main binaries:</p> > <dl> > <dt>tika-core/target/tika-core-*.jar</dt> > -<dd> Tika core library. Contains the core interfaces and classes of Tika, > but none of the parser implementations. Depends only on Java 5.</dd> > +<dd> Tika core library. Contains the core interfaces and classes of Tika, > but none of the parser implementations. Depends only on Java 6.</dd> > <dt>tika-parsers/target/tika-parsers-*.jar</dt> > <dd> Tika parsers. Collection of classes that implement the Tika Parser > interface based on various external parser libraries.</dd> > <dt>tika-app/target/tika-app-*.jar</dt> > > Modified: tika/site/publish/1.5/gettingstarted.html > URL: > http://svn.apache.org/viewvc/tika/site/publish/1.5/gettingstarted.html?rev=1622762&r1=1622761&r2=1622762&view=diff > ============================================================================== > --- tika/site/publish/1.5/gettingstarted.html (original) > +++ tika/site/publish/1.5/gettingstarted.html Fri Sep 5 19:14:58 2014 > @@ -94,13 +94,13 @@ > <div> > <pre>mvn install</pre></div> > <p>See the Maven documentation for more information about the available build > options.</p> > -<p>Note that you need Java 5 or higher to build Tika.</p></div> > +<p>Note that you need Java 6 or higher to build Tika.</p></div> > <div class="section"> > <h2>Build artifacts<a name="Build_artifacts"></a></h2> > <p>The Tika build consists of a number of components and produces the > following main binaries:</p> > <dl> > <dt>tika-core/target/tika-core-*.jar</dt> > -<dd> Tika core library. Contains the core interfaces and classes of Tika, > but none of the parser implementations. Depends only on Java 5.</dd> > +<dd> Tika core library. Contains the core interfaces and classes of Tika, > but none of the parser implementations. Depends only on Java 6.</dd> > <dt>tika-parsers/target/tika-parsers-*.jar</dt> > <dd> Tika parsers. Collection of classes that implement the Tika Parser > interface based on various external parser libraries.</dd> > <dt>tika-app/target/tika-app-*.jar</dt> > > Added: tika/site/publish/1.6/detection.html > URL: > http://svn.apache.org/viewvc/tika/site/publish/1.6/detection.html?rev=1622762&view=auto > ============================================================================== > --- tika/site/publish/1.6/detection.html (added) > +++ tika/site/publish/1.6/detection.html Fri Sep 5 19:14:58 2014 > @@ -0,0 +1,357 @@ > +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > + > +<!-- > + Licensed to the Apache Software Foundation (ASF) under one > + or more contributor license agreements. See the NOTICE file > + distributed with this work for additional information > + regarding copyright ownership. The ASF licenses this file > + to you under the Apache License, Version 2.0 (the > + "License"); you may not use this file except in compliance > + with the License. You may obtain a copy of the License at > + > + http://www.apache.org/licenses/LICENSE-2.0 > + > + Unless required by applicable law or agreed to in writing, > + software distributed under the License is distributed on an > + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY > + KIND, either express or implied. See the License for the > + specific language governing permissions and limitations > + under the License. > +--> > + > + > + > + > + > + > + > +<html xmlns="http://www.w3.org/1999/xhtml"> > + <head> > + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> > + <title>Apache Tika - Content Detection</title> > + <style type="text/css" media="all"> > + @import url("../css/site.css"); > + </style> > + <link rel="icon" type="image/png" href="../tikaNoText16.png" /> > + <script type="text/javascript"> > + function selectProvider(form) { > + provider = form.elements['searchProvider'].value; > + if (provider == "any") { > + if (Math.random() > 0.5) { > + provider = "lucid"; > + } else { > + provider = "sl"; > + } > + } > + if (provider == "lucid") { > + form.action = "http://find.searchhub.org/p:tika"; > + } else if (provider == "sl") { > + form.action = "http://search-lucene.com/tika"; > + } > + days = 90; > + date = new Date(); > + date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000)); > + expires = "; expires=" + date.toGMTString(); > + document.cookie = "searchProvider=" + provider + expires + "; > path=/"; > + } > + function initProvider() { > + if (document.cookie.length>0) { > + cStart=document.cookie.indexOf("searchProvider="); > + if (cStart!=-1) { > + cStart=cStart + "searchProvider=".length; > + cEnd=document.cookie.indexOf(";", cStart); > + if (cEnd==-1) { > + cEnd=document.cookie.length; > + } > + provider = unescape(document.cookie.substring(cStart,cEnd)); > + document.forms['searchform'].elements['searchProvider'].value = > provider; > + } > + } > + document.forms['searchform'].elements['q'].focus(); > + } > + </script> > + </head> > + <body onLoad="initProvider();"> > + <div id="body"> > + <div id="banner"> > + <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika" > + ><img src="http://tika.apache.org/tika.png" alt="Apache Tika" > + width="292" height="100"/></a> > + <a href="http://www.apache.org/" id="bannerRight" > + title="The Apache Software Foundation" > + ><img src="http://tika.apache.org/asf-logo.gif" alt="The Apache > Software Foundation" > + width="387" height="100"/></a> > + </div> > + <div id="content"> > + <!-- Licensed to the Apache Software Foundation (ASF) under one or > more --><!-- contributor license agreements. See the NOTICE file distributed > with --><!-- this work for additional information regarding copyright > ownership. --><!-- The ASF licenses this file to You under the Apache > License, Version 2.0 --><!-- (the "License"); you may not use this file > except in compliance with --><!-- the License. You may obtain a copy of the > License at --><!-- --><!-- http://www.apache.org/licenses/LICENSE-2.0 > --><!-- --><!-- Unless required by applicable law or agreed to in writing, > software --><!-- distributed under the License is distributed on an "AS IS" > BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express > or implied. --><!-- See the License for the specific language governing > permissions and --><!-- limitations under the License. --><div > class="section"> > +<h2>Content Detection<a name="Content_Detection"></a></h2> > +<p>This page gives you information on how content and language detection > works with Apache Tika, and how to tune the behaviour of Tika.</p> > +<ul> > +<li><a href="#Content_Detection">Content Detection</a> > +<ul> > +<li><a href="#The_Detector_Interface">The Detector Interface</a></li> > +<li><a href="#Mime_Magic_Detction">Mime Magic Detction</a></li> > +<li><a href="#Resource_Name_Based_Detection">Resource Name Based > Detection</a></li> > +<li><a href="#Known_Content_Type_Detection">Known Content Type > "Detection</a></li> > +<li><a href="#The_default_Mime_Types_Detector">The default Mime Types > Detector</a></li> > +<li><a href="#Container_Aware_Detection">Container Aware Detection</a></li> > +<li><a href="#The_default_Tika_Detector">The default Tika Detector</a></li> > +<li><a href="#Ways_of_triggering_Detection">Ways of triggering > Detection</a></li> > +<li><a href="#Language_Detection">Language Detection</a></li></ul></li></ul> > +<div class="section"> > +<h3><a name="The_Detector_Interface">The Detector Interface</a></h3> > +<p>The <a > href="./api/org/apache/tika/detect/Detector.html">org.apache.tika.detect.Detector</a> > interface is the basis for most of the content type detection in Apache > Tika. All the different ways of detecting content all implement the same > common method:</p> > +<div> > +<pre>MediaType detect(java.io.InputStream input, > + Metadata metadata) throws java.io.IOException</pre></div> > +<p>The <tt>detect</tt> method takes the stream to inspect, and a > <tt>Metadata</tt> object that holds any additional information on the > content. The detector will return a <a > href="./api/org/apache/tika/mime/MediaType.html">MediaType</a> object > describing its best guess as to the type of the file.</p> > +<p>In general, only two keys on the Metadata object are used by Detectors. > These are <tt>Metadata.RESOURCE_NAME_KEY</tt> which should hold the name of > the file (where known), and <tt>Metadata.CONTENT_TYPE</tt> which should hold > the advertised content type of the file (eg from a webserver or a content > repository).</p></div> > +<div class="section"> > +<h3><a name="Mime_Magic_Detction">Mime Magic Detction</a></h3> > +<p>By looking for special ("magic") patterns of bytes near the > start of the file, it is often possible to detect the type of the file. For > some file types, this is a simple process. For others, typically container > based formats, the magic detection may not be enough. (More detail on > detecting container formats below)</p> > +<p>Tika is able to make use of a a mime magic info file, in the <a > class="externalLink" > href="http://www.freedesktop.org/standards/shared-mime-info">Freedesktop > MIME-info</a> format to peform mime magic detection. (Note that Tika supports > a few more match types than Freedesktop does)</p> > +<p>This is provided within Tika by <a > href="./api/org/apache/tika/detect/MagicDetector.html">org.apache.tika.detect.MagicDetector</a>. > It is most commonly access via <a > href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.MimeTypes</a>, > normally sourced from the <tt>tika-mimetypes.xml</tt> and > <tt>custom-mimetypes.xml</tt> files. For more information on defining your > own custom mimetypes, see <a > href="./parser_guide.html#Add_your_MIME-Type">the new parser > guide</a>.</p></div> > +<div class="section"> > +<h3><a name="Resource_Name_Based_Detection">Resource Name Based > Detection</a></h3> > +<p>Where the name of the file is known, it is sometimes possible to guess > the file type from the name or extension. Within the > <tt>tika-mimetypes.xml</tt> file is a list of patterns which are used to > identify the type from the filename.</p> > +<p>However, because files may be renamed, this method of detection is quick > but not always as accurate.</p> > +<p>This is provided within Tika by <a > href="./api/org/apache/tika/detect/NameDetector.html">org.apache.tika.detect.NameDetector</a>.</p></div> > +<div class="section"> > +<h3><a name="Known_Content_Type_Detection">Known Content Type > "Detection</a></h3> > +<p>Sometimes, the mime type for a file is already known, such as when > downloading from a webserver, or when retrieving from a content store. This > information can be used by detectors, such as <a > href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.MimeTypes</a>,</p></div> > +<div class="section"> > +<h3><a name="The_default_Mime_Types_Detector">The default Mime Types > Detector</a></h3> > +<p>By default, the mime type detection in Tika is provided by <a > href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.MimeTypes</a>. > This detector makes use of <tt>tika-mimetypes.xml</tt> to power magic based > and filename based detection.</p> > +<p>Firstly, magic based detection is used on the start of the file. If the > file is an XML file, then the start of the XML is processed to look for root > elements. Next, if available, the filename (from > <tt>Metadata.RESOURCE_NAME_KEY</tt>) is then used to improve the detail of > the detection, such as when magic detects a text file, and the filename hints > it's really a CSV. Finally, if available, the supplied content type (from > <tt>Metadata.CONTENT_TYPE</tt>) is used to further refine the type.</p></div> > +<div class="section"> > +<h3><a name="Container_Aware_Detection">Container Aware Detection</a></h3> > +<p>Several common file formats are actually held within a common container > format. One example is the PowerPoint .ppt and Word .doc formats, which are > both held within an OLE2 container. Another is Apple iWork formats, which are > actually a series of XML files within a Zip file.</p> > +<p>Using magic detection, it is easy to spot that a given file is an OLE2 > document, or a Zip file. Using magic detection alone, it is very difficult > (and often impossible) to tell what kind of file lives inside the > container.</p> > +<p>For some use cases, speed is important, so having a quick way to know the > container type is sufficient. For other cases however, you don't mind > spending a bit of time (and memory!) processing the container to get a more > accurate answer on its contents. For these cases, the additional container > aware detectors contained in the <tt>Tika Parsers</tt> jar should be used.</p> > +<p>Tika provides a wrapping detector in the form of <a > href="./api/org/apache/tika/detect/DefaultDetector.html">org.apache.tika.detect.DefaultDetector</a>. > This uses the service loader to discover all available detectors, including > any available container aware ones, and tries them in turn. For container > aware detection, include the <tt>Tika Parsers</tt> jar and its dependencies > in your project, then use DefaultDetector along with a > <tt>TikaInputStream</tt>.</p> > +<p>Because these container detectors needs to read the whole file to open > and inspect the container, they must be used with a <a > href="./api/org/apache/tika/io/TikaInputStream.html">org.apache.tika.io.TikaInputStream</a>. > If called with a regular <tt>InputStream</tt>, then all work will be done by > the default Mime Magic detection only.</p> > +<p>For more information on container formats and Tika, see <a > class="externalLink" > href="http://wiki.apache.org/tika/MetadataDiscussion"></a></p></div> > +<div class="section"> > +<h3><a name="The_default_Tika_Detector">The default Tika Detector</a></h3> > +<p>Just as with Parsers, Tika provides a special detector <a > href="./api/org/apache/tika/detect/DefaultDetector.html">org.apache.tika.detect.DefaultDetector</a> > which auto-detects (based on service files) the available detectors at > runtime, and tries these in turn to identify the file type.</p> > +<p>If only <tt>Tika Core</tt> is available, the Default Detector will work > only with Mime Magic and Resource Name detection. However, if <tt>Tika > Parsers</tt> (and its dependencies!) are available, additional detectors > which known about containers (such as zip and ole2) will be used as > appropriate, provided that detection is being performed with a <a > href="./api/org/apache/tika/io/TikaInputStream.html">org.apache.tika.io.TikaInputStream</a>. > Custom detectors can also be used as desired, they simply need to be listed > in a service file much as is done for <a > href="./parser_guide.html#List_the_new_parser">custom parsers</a>.</p></div> > +<div class="section"> > +<h3><a name="Ways_of_triggering_Detection">Ways of triggering > Detection</a></h3> > +<p>The simplest way to detect is through the <a > href="./api/org/apache/tika/Tika.html">Tika Facade class</a>, which provides > methods to detect based on <a > href="./api/org/apache/tika/Tika.html#detect(java.io.File)">File</a>, <a > href="./api/org/apache/tika/Tika.html#detect(java.io.InputStream)">InputStream</a>, > <a href="./api/org/apache/tika/Tika.html#detect(java.io.InputStream, > java.lang.String)">InputStream and Filename</a>, <a > href="./api/org/apache/tika/Tika.html#detect(java.lang.String)">Filename</a> > or a few others. It works best with a File or <a > href="./api/org/apache/tika/io/TikaInputStream.html">TikaInputStream</a>.</p> > +<p>Alternately, detection can be performed on a specific Detector, or using > <tt>DefaultDetector</tt> to have all available Detectors used. A typical > pattern would be something like:</p> > +<div> > +<pre>TikaConfig tika = new TikaConfig(); > + > +for (File f : myListOfFiles) { > + Metadata metadata = new Metadata(); > + metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString()); > + String mimetype = tika.getDetector().detect( > + TikaInputStream.get(f), metadata); > + System.out.println("File " + f + " is " + mimetype); > +} > +for (InputStream is : myListOfStreams) { > + String mimetype = tika.getDetector().detect( > + TikaInputStream.get(is), new Metadata()); > + System.out.println("Stream " + is + " is " + > mimetype); > +}</pre></div></div> > +<div class="section"> > +<h3><a name="Language_Detection">Language Detection</a></h3> > +<p>Tika is able to help identify the language of a piece of text, which is > useful when extracting text from document formats which do not include > language information in their metadata.</p> > +<p>The language detection is provided by <a > href="./api/org/apache/tika/language/LanguageIdentifier.html">org.apache.tika.language.LanguageIdentifier</a></p></div></div> > + </div> > + <div id="sidebar"> > + <div id="navigation"> > + <h5>Apache Tika</h5> > + <ul> > + > + <li class="none"> > + <a href="../index.html">Introduction</a> > + </li> > + > + <li class="none"> > + <a href="../download.html">Download</a> > + </li> > + > + <li class="none"> > + <a href="../contribute.html">Contribute</a> > + </li> > + > + <li class="none"> > + <a href="../mail-lists.html">Mailing Lists</a> > + </li> > + > + <li class="none"> > + <a href="http://wiki.apache.org/tika/" > class="externalLink">Tika Wiki</a> > + </li> > + > + <li class="none"> > + <a href="https://issues.apache.org/jira/browse/TIKA" > class="externalLink">Issue Tracker</a> > + </li> > + </ul> > + <h5>Documentation</h5> > + <ul> > + > + > + > + > + > + > + > + > + > + <li class="expanded"> > + <a href="../1.5/index.html">Apache Tika 1.5</a> > + <ul> > + > + <li class="none"> > + <a href="../1.5/gettingstarted.html">Getting Started</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/formats.html">Supported Formats</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/parser.html">Parser API</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/parser_guide.html">Parser 5min Quick > Start Guide</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/detection.html">Content and Language > Detection</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/api/">API Documentation</a> > + </li> > + </ul> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.4/index.html">Apache Tika 1.4</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.3/index.html">Apache Tika 1.3</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.2/index.html">Apache Tika 1.2</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.1/index.html">Apache Tika 1.1</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.0/index.html">Apache Tika 1.0</a> > + </li> > + </ul> > + <h5>The Apache Software Foundation</h5> > + <ul> > + > + <li class="none"> > + <a href="http://www.apache.org/foundation/" > class="externalLink">About</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/licenses/" > class="externalLink">License</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/security/" > class="externalLink">Security</a> > + </li> > + > + <li class="none"> > + <a > href="http://www.apache.org/foundation/sponsorship.html" > class="externalLink">Sponsorship</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/foundation/thanks.html" > class="externalLink">Thanks</a> > + </li> > + </ul> > + > + <div id="search"> > + <h5>Search with Apache Solr</h5> > + <form action="http://search.lucidimagination.com/p:tika" > + method="get" id="searchform"> > + <input type="text" id="query" name="q"/> > + <select name="searchProvider" id="searchProvider"> > + <option value="any">provider</option> > + <option value="lucid">Lucid Find</option> > + <option value="sl">Search-Lucene</option> > + </select> > + <input type="submit" id="submit" value="Search" name="Search" > + onclick="selectProvider(this.form)"/> > + </form> > + </div> > + > + <div id="bookpromo"> > + <h5>Books about Tika</h5> > + <p> > + <a href="http://manning.com/mattmann/" title="Tika in Action" > + ><img src="../mattmann_cover150.jpg" > + width="150" height="186"/></a> > + </p> > + </div> > + </div> > + </div> > + <div id="footer"> > + <p> > + Copyright © 2014 > + <a href="http://www.apache.org/">The Apache Software > Foundation</a>. > + Site powered by <a href="http://maven.apache.org/">Apache > Maven</a>. > + Search powered by > + <a href="http://www.lucidimagination.com">Lucid Imagination</a> > + and <a href="http://sematext.com">Sematext</a>. > + <br/> > + Apache Tika, Tika, Apache, the Apache feather logo, and the Apache > + Tika project logo are trademarks of The Apache Software Foundation. > + </p> > + </div> > + </div> > + </body> > +</html> > > Modified: tika/site/publish/1.6/formats.html > URL: > http://svn.apache.org/viewvc/tika/site/publish/1.6/formats.html?rev=1622762&r1=1622761&r2=1622762&view=diff > ============================================================================== > --- tika/site/publish/1.6/formats.html (original) > +++ tika/site/publish/1.6/formats.html Fri Sep 5 19:14:58 2014 > @@ -110,7 +110,9 @@ > <li><a href="#Mail_formats">Mail formats</a></li> > <li><a href="#CAD_formats">CAD formats</a></li> > <li><a href="#Font_formats">Font formats</a></li> > -<li><a href="#Executable_programs_and_libraries">Executable programs and > libraries</a></li></ul></li></ul> > +<li><a href="#Scientific_formats">Scientific formats</a></li> > +<li><a href="#Executable_programs_and_libraries">Executable programs and > libraries</a></li> > +<li><a href="#Crypto_formats">Crypto formats</a></li></ul></li></ul> > <div class="section"> > <h3><a name="HyperText_Markup_Language">HyperText Markup Language</a></h3> > <p>The HyperText Markup Language (HTML) is the lingua franca of the web. Tika > uses the <a class="externalLink" > href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> library to > support virtually any kind of HTML found on the web. The output from the <a > href="./api/org/apache/tika/parser/html/HtmlParser.html">HtmlParser</a> class > is guaranteed to be well-formed and valid XHTML, and various heuristics are > used to prevent things like inline scripts from cluttering the extracted text > content.</p></div> > @@ -131,7 +133,8 @@ > <p>The <a > href="./api/org/apache/tika/parser/pdf/PDFParser.html">PDFParser</a> class > parsers Portable Document Format (PDF) documents using the <a > class="externalLink" href="http://pdfbox.apache.org/">Apache PDFBox</a> > library.</p></div> > <div class="section"> > <h3><a name="Electronic_Publication_Format">Electronic Publication > Format</a></h3> > -<p>The <a > href="./api/org/apache/tika/parser/epub/EpubParser.html">EpubParser</a> class > supports the Electronic Publication Format (EPUB) used for many digital > books.</p></div> > +<p>The <a > href="./api/org/apache/tika/parser/epub/EpubParser.html">EpubParser</a> class > supports the Electronic Publication Format (EPUB) used for many digital > books.</p> > +<p>The <a > href="./api/org/apache/tika/parser/xml/FictionBookParser.html">FictionBookParser</a> > class supports the xml-based Fiction Book publishing format.</p></div> > <div class="section"> > <h3><a name="Rich_Text_Format">Rich Text Format</a></h3> > <p>The <a > href="./api/org/apache/tika/parser/rtf/RTFParser.html">RTFParser</a> class > uses the standard javax.swing.text.rtf feature to extract text content from > Rich Text Format (RTF) documents.</p></div> > @@ -143,7 +146,8 @@ > <p>Extracting text content from plain text files seems like a simple task > until you start thinking of all the possible character encodings. The <a > href="./api/org/apache/tika/parser/txt/TXTParser.html">TXTParser</a> class > uses encoding detection code from the <a class="externalLink" > href="http://site.icu-project.org/">ICU</a> project to automatically detect > the character encoding of a text document.</p></div> > <div class="section"> > <h3><a name="Feed_and_Syndication_formats">Feed and Syndication > formats</a></h3> > -<p>The <a > href="./api/org/apache/tika/parser/feed/FeedParser.html">FeedParser</a> class > supports the RSS and Atom feed syndication formats.</p></div> > +<p>The <a > href="./api/org/apache/tika/parser/feed/FeedParser.html">FeedParser</a> class > supports the RSS and Atom feed syndication formats.</p> > +<p>The <a > href="./api/org/apache/tika/parser/iptc/IptcAnpaParser.html">IptcAnpaParser</a> > class supports the IPTC ANPA News Wire feed format.</p></div> > <div class="section"> > <h3><a name="Help_formats">Help formats</a></h3> > <p>The <a > href="./api/org/apache/tika/parser/chm/ChmParser.html">ChmParser</a> class > supports the CHM Help format.</p></div> > @@ -167,6 +171,7 @@ > <div class="section"> > <h3><a name="Mail_formats">Mail formats</a></h3> > <p>The <a > href="./api/org/apache/tika/parser/mbox/MboxParser.html">MboxParser</a> can > extract email messages from the mbox format used by many email archives and > Unix-style mailboxes.</p> > +<p>The <a > href="./api/org/apache/tika/parser/mail/RFC822Parser.html">RFC822Parser</a> > can process single email messages in the RFC 822 format used by many email > clients in their archives / exports.</p> > <p>The <a > href="./api/org/apache/tika/parser/mbox/PSTParser.html">PSDParser</a> can > extract email messages from the Microsoft Outlook PST email format.</p></div> > <div class="section"> > <h3><a name="CAD_formats">CAD formats</a></h3> > @@ -175,8 +180,16 @@ > <h3><a name="Font_formats">Font formats</a></h3> > <p>The <a > href="./api/org/apache/tika/parser/font/TrueTypeParser.html">TrueTypeParser</a> > class can extract simple metadata from the TrueType font format. The <a > href="./api/org/apache/tika/parser/font/AdobeFontMetricParser.html">AdobeFontMetricParser</a> > class does something similar for Adobe Font Metrics files.</p></div> > <div class="section"> > +<h3><a name="Scientific_formats">Scientific formats</a></h3> > +<p>The <a > href="./api/org/apache/tika/parser/hdf/HDFParser.html">HDFParser</a> is able > to extract attribute metadata from the HDF scientific file format.</p> > +<p>The <a > href="./api/org/apache/tika/parser/netcdf/NetCDFParser.html">NetCDFParser</a> > is able to extract attribute metadata from the NetCDF scientific file > format.</p> > +<p>The <a > href="./api/org/apache/tika/parser/mat/MatParser.html">MatParser</a> is able > to extract attribute metadata from the Matlab scientific file > format.</p></div> > +<div class="section"> > <h3><a name="Executable_programs_and_libraries">Executable programs and > libraries</a></h3> > -<p>The <a > href="./api/org/apache/tika/parser/executable/ExecutableParser.html">ExecutableParser</a> > can extract metadata information on platforms, architectures and types from > a range of executable formats and libraries, such as Windows Executables and > Linux / BSD programs and libraries.</p></div></div> > +<p>The <a > href="./api/org/apache/tika/parser/executable/ExecutableParser.html">ExecutableParser</a> > can extract metadata information on platforms, architectures and types from > a range of executable formats and libraries, such as Windows Executables and > Linux / BSD programs and libraries.</p></div> > +<div class="section"> > +<h3><a name="Crypto_formats">Crypto formats</a></h3> > +<p>The <a > href="./api/org/apache/tika/parser/crypto/Pkcs7Parser.html">Pkcs7Parser</a> > is able to parse the contents of PKCS7 signed messages, but doesn't include > any information from the outer PKCS7 wrapper.</p></div></div> > <div class="section"> > <h2>Full list of supported formats:<a > name="Full_list_of_supported_formats:"></a></h2> > <ul> > @@ -270,6 +283,9 @@ > <li>org.apache.tika.parser.mail.<a > href="./api/org/apache/tika/parser/mail/RFC822Parser">RFC822Parser</a> > <ul> > <li>message/rfc822</li></ul></li> > +<li>org.apache.tika.parser.mat.<a > href="./api/org/apache/tika/parser/mat/MatParser">MatParser</a> > +<ul> > +<li>application/x-matlab-data</li></ul></li> > <li>org.apache.tika.parser.mbox.<a > href="./api/org/apache/tika/parser/mbox/MboxParser">MboxParser</a> > <ul> > <li>application/mbox</li></ul></li> > > Added: tika/site/publish/1.6/gettingstarted.html > URL: > http://svn.apache.org/viewvc/tika/site/publish/1.6/gettingstarted.html?rev=1622762&view=auto > ============================================================================== > --- tika/site/publish/1.6/gettingstarted.html (added) > +++ tika/site/publish/1.6/gettingstarted.html Fri Sep 5 19:14:58 2014 > @@ -0,0 +1,413 @@ > +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > + > +<!-- > + Licensed to the Apache Software Foundation (ASF) under one > + or more contributor license agreements. See the NOTICE file > + distributed with this work for additional information > + regarding copyright ownership. The ASF licenses this file > + to you under the Apache License, Version 2.0 (the > + "License"); you may not use this file except in compliance > + with the License. You may obtain a copy of the License at > + > + http://www.apache.org/licenses/LICENSE-2.0 > + > + Unless required by applicable law or agreed to in writing, > + software distributed under the License is distributed on an > + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY > + KIND, either express or implied. See the License for the > + specific language governing permissions and limitations > + under the License. > +--> > + > + > + > + > + > + > + > +<html xmlns="http://www.w3.org/1999/xhtml"> > + <head> > + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> > + <title>Apache Tika - Getting Started with Apache Tika</title> > + <style type="text/css" media="all"> > + @import url("../css/site.css"); > + </style> > + <link rel="icon" type="image/png" href="../tikaNoText16.png" /> > + <script type="text/javascript"> > + function selectProvider(form) { > + provider = form.elements['searchProvider'].value; > + if (provider == "any") { > + if (Math.random() > 0.5) { > + provider = "lucid"; > + } else { > + provider = "sl"; > + } > + } > + if (provider == "lucid") { > + form.action = "http://find.searchhub.org/p:tika"; > + } else if (provider == "sl") { > + form.action = "http://search-lucene.com/tika"; > + } > + days = 90; > + date = new Date(); > + date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000)); > + expires = "; expires=" + date.toGMTString(); > + document.cookie = "searchProvider=" + provider + expires + "; > path=/"; > + } > + function initProvider() { > + if (document.cookie.length>0) { > + cStart=document.cookie.indexOf("searchProvider="); > + if (cStart!=-1) { > + cStart=cStart + "searchProvider=".length; > + cEnd=document.cookie.indexOf(";", cStart); > + if (cEnd==-1) { > + cEnd=document.cookie.length; > + } > + provider = unescape(document.cookie.substring(cStart,cEnd)); > + document.forms['searchform'].elements['searchProvider'].value = > provider; > + } > + } > + document.forms['searchform'].elements['q'].focus(); > + } > + </script> > + </head> > + <body onLoad="initProvider();"> > + <div id="body"> > + <div id="banner"> > + <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika" > + ><img src="http://tika.apache.org/tika.png" alt="Apache Tika" > + width="292" height="100"/></a> > + <a href="http://www.apache.org/" id="bannerRight" > + title="The Apache Software Foundation" > + ><img src="http://tika.apache.org/asf-logo.gif" alt="The Apache > Software Foundation" > + width="387" height="100"/></a> > + </div> > + <div id="content"> > + <!-- Licensed to the Apache Software Foundation (ASF) under one or > more --><!-- contributor license agreements. See the NOTICE file distributed > with --><!-- this work for additional information regarding copyright > ownership. --><!-- The ASF licenses this file to You under the Apache > License, Version 2.0 --><!-- (the "License"); you may not use this file > except in compliance with --><!-- the License. You may obtain a copy of the > License at --><!-- --><!-- http://www.apache.org/licenses/LICENSE-2.0 > --><!-- --><!-- Unless required by applicable law or agreed to in writing, > software --><!-- distributed under the License is distributed on an "AS IS" > BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express > or implied. --><!-- See the License for the specific language governing > permissions and --><!-- limitations under the License. --><div > class="section"> > +<h2>Getting Started with Apache Tika<a > name="Getting_Started_with_Apache_Tika"></a></h2> > +<p>This document describes how to build Apache Tika from sources and how to > start using Tika in an application.</p></div> > +<div class="section"> > +<h2>Getting and building the sources<a > name="Getting_and_building_the_sources"></a></h2> > +<p>To build Tika from sources you first need to either <a > href="../download.html">download</a> a source release or <a > href="../source-repository.html">checkout</a> the latest sources from version > control.</p> > +<p>Once you have the sources, you can build them using the <a > class="externalLink" href="http://maven.apache.org/">Maven 2</a> build > system. Executing the following command in the base directory will build the > sources and install the resulting artifacts in your local Maven > repository.</p> > +<div> > +<pre>mvn install</pre></div> > +<p>See the Maven documentation for more information about the available > build options.</p> > +<p>Note that you need Java 6 or higher to build Tika.</p></div> > +<div class="section"> > +<h2>Build artifacts<a name="Build_artifacts"></a></h2> > +<p>The Tika build consists of a number of components and produces the > following main binaries:</p> > +<dl> > +<dt>tika-core/target/tika-core-*.jar</dt> > +<dd> Tika core library. Contains the core interfaces and classes of Tika, > but none of the parser implementations. Depends only on Java 6.</dd> > +<dt>tika-parsers/target/tika-parsers-*.jar</dt> > +<dd> Tika parsers. Collection of classes that implement the Tika Parser > interface based on various external parser libraries.</dd> > +<dt>tika-app/target/tika-app-*.jar</dt> > +<dd> Tika application. Combines the above components and all the external > parser libraries into a single runnable jar with a GUI and a command line > interface.</dd> > +<dt>tika-bundle/target/tika-bundle-*.jar</dt> > +<dd> Tika bundle. An OSGi bundle that combines tika-parsers with > non-OSGified parser libraries to make them easy to deploy in an OSGi > environment.</dd></dl></div> > +<div class="section"> > +<h2>Using Tika as a Maven dependency<a > name="Using_Tika_as_a_Maven_dependency"></a></h2> > +<p>The core library, tika-core, contains the key interfaces and classes of > Tika and can be used by itself if you don't need the full set of parsers from > the tika-parsers component. The tika-core dependency looks like this:</p> > +<div> > +<pre> <dependency> > + <groupId>org.apache.tika</groupId> > + <artifactId>tika-core</artifactId> > + <version>...</version> > + </dependency></pre></div> > +<p>If you want to use Tika to parse documents (instead of simply detecting > document types, etc.), you'll want to depend on tika-parsers instead: </p> > +<div> > +<pre> <dependency> > + <groupId>org.apache.tika</groupId> > + <artifactId>tika-parsers</artifactId> > + <version>...</version> > + </dependency></pre></div> > +<p>Note that adding this dependency will introduce a number of transitive > dependencies to your project, including one on tika-core. You need to make > sure that these dependencies won't conflict with your existing project > dependencies. You can use the following command in the tika-parsers directory > to get a full listing of all the dependencies.</p> > +<div> > +<pre>$ mvn dependency:tree | grep :compile</pre></div></div> > +<div class="section"> > +<h2>Using Tika in an Ant project<a > name="Using_Tika_in_an_Ant_project"></a></h2> > +<p>Unless you use a dependency manager tool like <a class="externalLink" > href="http://ant.apache.org/ivy/">Apache Ivy</a>, the easiest way to use Tika > is to include either the tika-core or the tika-app jar in your classpath, > depending on whether you want just the core functionality or also all the > parser implementations.</p> > +<div> > +<pre><classpath> > + ... <!-- your other classpath entries --> > + > + <!-- either: --> > + <pathelement > location="path/to/tika-core-${tika.version}.jar"/> > + <!-- or: --> > + <pathelement > location="path/to/tika-app-${tika.version}.jar"/> > + > +</classpath></pre></div></div> > +<div class="section"> > +<h2>Using Tika as a command line utility<a > name="Using_Tika_as_a_command_line_utility"></a></h2> > +<p>The Tika application jar (tika-app-*.jar) can be used as a command line > utility for extracting text content and metadata from all sorts of files. > This runnable jar contains all the dependencies it needs, so you don't need > to worry about classpath settings to run it.</p> > +<p>The usage instructions are shown below.</p> > +<div> > +<pre>usage: java -jar tika-app.jar [option...] [file|port...] > + > +Options: > + -? or --help Print this usage message > + -v or --verbose Print debug level messages > + -V or --version Print the Apache Tika version number > + > + -g or --gui Start the Apache Tika GUI > + -s or --server Start the Apache Tika server > + -f or --fork Use Fork Mode for out-of-process extraction > + > + -x or --xml Output XHTML content (default) > + -h or --html Output HTML content > + -t or --text Output plain text content > + -T or --text-main Output plain text content (main content only) > + -m or --metadata Output only metadata > + -j or --json Output metadata in JSON > + -y or --xmp Output metadata in XMP > + -l or --language Output only language > + -d or --detect Detect document type > + -eX or --encoding=X Use output encoding X > + -pX or --password=X Use document password X > + -z or --extract Extract all attachements into current directory > + --extract-dir=<dir> Specify target directory for -z > + -r or --pretty-print For XML and XHTML outputs, adds newlines and > + whitespace, for better readability > + > + --create-profile=X > + Create NGram profile, where X is a profile name > + --list-parsers > + List the available document parsers > + --list-parser-details > + List the available document parsers, and their supported mime types > + --list-detectors > + List the available document detectors > + --list-met-models > + List the available metadata models, and their supported keys > + --list-supported-types > + List all known media types and related information > + > +Description: > + Apache Tika will parse the file(s) specified on the > + command line and output the extracted text content > + or metadata to standard output. > + > + Instead of a file name you can also specify the URL > + of a document to be parsed. > + > + If no file name or URL is specified (or the special > + name "-" is used), then the standard input stream > + is parsed. If no arguments were given and no input > + data is available, the GUI is started instead. > + > +- GUI mode > + > + Use the "--gui" (or "-g") option to start the > + Apache Tika GUI. You can drag and drop files from > + a normal file explorer to the GUI window to extract > + text content and metadata from the files. > + > +- Server mode > + > + Use the "--server" (or "-s") option to start the > + Apache Tika server. The server will listen to the > + ports you specify as one or more arguments.</pre></div> > +<p>You can also use the jar as a component in a Unix pipeline or as an > external tool in many scripting languages.</p> > +<div> > +<pre># Check if an Internet resource contains a specific keyword > +curl http://.../document.doc \ > + | java -jar tika-app.jar --text \ > + | grep -q keyword</pre></div></div> > + </div> > + <div id="sidebar"> > + <div id="navigation"> > + <h5>Apache Tika</h5> > + <ul> > + > + <li class="none"> > + <a href="../index.html">Introduction</a> > + </li> > + > + <li class="none"> > + <a href="../download.html">Download</a> > + </li> > + > + <li class="none"> > + <a href="../contribute.html">Contribute</a> > + </li> > + > + <li class="none"> > + <a href="../mail-lists.html">Mailing Lists</a> > + </li> > + > + <li class="none"> > + <a href="http://wiki.apache.org/tika/" > class="externalLink">Tika Wiki</a> > + </li> > + > + <li class="none"> > + <a href="https://issues.apache.org/jira/browse/TIKA" > class="externalLink">Issue Tracker</a> > + </li> > + </ul> > + <h5>Documentation</h5> > + <ul> > + > + > + > + > + > + > + > + > + > + <li class="expanded"> > + <a href="../1.5/index.html">Apache Tika 1.5</a> > + <ul> > + > + <li class="none"> > + <a href="../1.5/gettingstarted.html">Getting Started</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/formats.html">Supported Formats</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/parser.html">Parser API</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/parser_guide.html">Parser 5min Quick > Start Guide</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/detection.html">Content and Language > Detection</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/api/">API Documentation</a> > + </li> > + </ul> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.4/index.html">Apache Tika 1.4</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.3/index.html">Apache Tika 1.3</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.2/index.html">Apache Tika 1.2</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.1/index.html">Apache Tika 1.1</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.0/index.html">Apache Tika 1.0</a> > + </li> > + </ul> > + <h5>The Apache Software Foundation</h5> > + <ul> > + > + <li class="none"> > + <a href="http://www.apache.org/foundation/" > class="externalLink">About</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/licenses/" > class="externalLink">License</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/security/" > class="externalLink">Security</a> > + </li> > + > + <li class="none"> > + <a > href="http://www.apache.org/foundation/sponsorship.html" > class="externalLink">Sponsorship</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/foundation/thanks.html" > class="externalLink">Thanks</a> > + </li> > + </ul> > + > + <div id="search"> > + <h5>Search with Apache Solr</h5> > + <form action="http://search.lucidimagination.com/p:tika" > + method="get" id="searchform"> > + <input type="text" id="query" name="q"/> > + <select name="searchProvider" id="searchProvider"> > + <option value="any">provider</option> > + <option value="lucid">Lucid Find</option> > + <option value="sl">Search-Lucene</option> > + </select> > + <input type="submit" id="submit" value="Search" name="Search" > + onclick="selectProvider(this.form)"/> > + </form> > + </div> > + > + <div id="bookpromo"> > + <h5>Books about Tika</h5> > + <p> > + <a href="http://manning.com/mattmann/" title="Tika in Action" > + ><img src="../mattmann_cover150.jpg" > + width="150" height="186"/></a> > + </p> > + </div> > + </div> > + </div> > + <div id="footer"> > + <p> > + Copyright © 2014 > + <a href="http://www.apache.org/">The Apache Software > Foundation</a>. > + Site powered by <a href="http://maven.apache.org/">Apache > Maven</a>. > + Search powered by > + <a href="http://www.lucidimagination.com">Lucid Imagination</a> > + and <a href="http://sematext.com">Sematext</a>. > + <br/> > + Apache Tika, Tika, Apache, the Apache feather logo, and the Apache > + Tika project logo are trademarks of The Apache Software Foundation. > + </p> > + </div> > + </div> > + </body> > +</html> > > Added: tika/site/publish/1.6/parser.html > URL: > http://svn.apache.org/viewvc/tika/site/publish/1.6/parser.html?rev=1622762&view=auto > ============================================================================== > --- tika/site/publish/1.6/parser.html (added) > +++ tika/site/publish/1.6/parser.html Fri Sep 5 19:14:58 2014 > @@ -0,0 +1,372 @@ > +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > + > +<!-- > + Licensed to the Apache Software Foundation (ASF) under one > + or more contributor license agreements. See the NOTICE file > + distributed with this work for additional information > + regarding copyright ownership. The ASF licenses this file > + to you under the Apache License, Version 2.0 (the > + "License"); you may not use this file except in compliance > + with the License. You may obtain a copy of the License at > + > + http://www.apache.org/licenses/LICENSE-2.0 > + > + Unless required by applicable law or agreed to in writing, > + software distributed under the License is distributed on an > + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY > + KIND, either express or implied. See the License for the > + specific language governing permissions and limitations > + under the License. > +--> > + > + > + > + > + > + > + > +<html xmlns="http://www.w3.org/1999/xhtml"> > + <head> > + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> > + <title>Apache Tika - The Parser interface</title> > + <style type="text/css" media="all"> > + @import url("../css/site.css"); > + </style> > + <link rel="icon" type="image/png" href="../tikaNoText16.png" /> > + <script type="text/javascript"> > + function selectProvider(form) { > + provider = form.elements['searchProvider'].value; > + if (provider == "any") { > + if (Math.random() > 0.5) { > + provider = "lucid"; > + } else { > + provider = "sl"; > + } > + } > + if (provider == "lucid") { > + form.action = "http://find.searchhub.org/p:tika"; > + } else if (provider == "sl") { > + form.action = "http://search-lucene.com/tika"; > + } > + days = 90; > + date = new Date(); > + date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000)); > + expires = "; expires=" + date.toGMTString(); > + document.cookie = "searchProvider=" + provider + expires + "; > path=/"; > + } > + function initProvider() { > + if (document.cookie.length>0) { > + cStart=document.cookie.indexOf("searchProvider="); > + if (cStart!=-1) { > + cStart=cStart + "searchProvider=".length; > + cEnd=document.cookie.indexOf(";", cStart); > + if (cEnd==-1) { > + cEnd=document.cookie.length; > + } > + provider = unescape(document.cookie.substring(cStart,cEnd)); > + document.forms['searchform'].elements['searchProvider'].value = > provider; > + } > + } > + document.forms['searchform'].elements['q'].focus(); > + } > + </script> > + </head> > + <body onLoad="initProvider();"> > + <div id="body"> > + <div id="banner"> > + <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika" > + ><img src="http://tika.apache.org/tika.png" alt="Apache Tika" > + width="292" height="100"/></a> > + <a href="http://www.apache.org/" id="bannerRight" > + title="The Apache Software Foundation" > + ><img src="http://tika.apache.org/asf-logo.gif" alt="The Apache > Software Foundation" > + width="387" height="100"/></a> > + </div> > + <div id="content"> > + <!-- Licensed to the Apache Software Foundation (ASF) under one or > more --><!-- contributor license agreements. See the NOTICE file distributed > with --><!-- this work for additional information regarding copyright > ownership. --><!-- The ASF licenses this file to You under the Apache > License, Version 2.0 --><!-- (the "License"); you may not use this file > except in compliance with --><!-- the License. You may obtain a copy of the > License at --><!-- --><!-- http://www.apache.org/licenses/LICENSE-2.0 > --><!-- --><!-- Unless required by applicable law or agreed to in writing, > software --><!-- distributed under the License is distributed on an "AS IS" > BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express > or implied. --><!-- See the License for the specific language governing > permissions and --><!-- limitations under the License. --><div > class="section"> > +<h2>The Parser interface<a name="The_Parser_interface"></a></h2> > +<p>The <a > href="./api/org/apache/tika/parser/Parser.html">org.apache.tika.parser.Parser</a> > interface is the key concept of Apache Tika. It hides the complexity of > different file formats and parsing libraries while providing a simple and > powerful mechanism for client applications to extract structured text content > and metadata from all sorts of documents. All this is achieved with a single > method:</p> > +<div> > +<pre>void parse( > + InputStream stream, ContentHandler handler, Metadata metadata, > + ParseContext context) throws IOException, SAXException, > TikaException;</pre></div> > +<p>The <tt>parse</tt> method takes the document to be parsed and related > metadata as input and outputs the results as XHTML SAX events and extra > metadata. The parse context argument is used to specify context information > (like the current local) that is not related to any individual document. The > main criteria that lead to this design were:</p> > +<dl> > +<dt>Streamed parsing</dt> > +<dd>The interface should require neither the client application nor the > parser implementation to keep the full document content in memory or spooled > to disk. This allows even huge documents to be parsed without excessive > resource requirements.</dd> > +<dt>Structured content</dt> > +<dd>A parser implementation should be able to include structural information > (headings, links, etc.) in the extracted content. A client application can > use this information for example to better judge the relevance of different > parts of the parsed document.</dd> > +<dt>Input metadata</dt> > +<dd>A client application should be able to include metadata like the file > name or declared content type with the document to be parsed. The parser > implementation can use this information to better guide the parsing > process.</dd> > +<dt>Output metadata</dt> > +<dd>A parser implementation should be able to return document metadata in > addition to document content. Many document formats contain metadata like the > name of the author that may be useful to client applications.</dd> > +<dt>Context sensitivity</dt> > +<dd>While the default settings and behaviour of Tika parsers should work > well for most use cases, there are still situations where more fine-grained > control over the parsing process is desirable. It should be easy to inject > such context-specific information to the parsing process without breaking the > layers of abstraction.</dd></dl> > +<p>These criteria are reflected in the arguments of the <tt>parse</tt> > method.</p> > +<div class="section"> > +<h3>Document input stream<a name="Document_input_stream"></a></h3> > +<p>The first argument is an <a class="externalLink" > href="http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html">InputStream</a> > for reading the document to be parsed.</p> > +<p>If this document stream can not be read, then parsing stops and the > thrown <a class="externalLink" > href="http://docs.oracle.com/javase/6/docs/api/java/io/IOException.html">IOException</a> > is passed up to the client application. If the stream can be read but not > parsed (for example if the document is corrupted), then the parser throws a > <a > href="./api/org/apache/tika/exception/TikaException.html">TikaException</a>.</p> > +<p>The parser implementation will consume this stream but <i>will not close > it</i>. Closing the stream is the responsibility of the client application > that opened it in the first place. The recommended pattern for using streams > with the <tt>parse</tt> method is:</p> > +<div> > +<pre>InputStream stream = ...; // open the stream > +try { > + parser.parse(stream, ...); // parse the stream > +} finally { > + stream.close(); // close the stream > +}</pre></div> > +<p>Some document formats like the OLE2 Compound Document Format used by > Microsoft Office are best parsed as random access files. In such cases the > content of the input stream is automatically spooled to a temporary file that > gets removed once parsed. A future version of Tika may make it possible to > avoid this extra file if the input document is already a file in the local > file system. See <a class="externalLink" > href="https://issues.apache.org/jira/browse/TIKA-153">TIKA-153</a> for the > status of this feature request.</p></div> > +<div class="section"> > +<h3>XHTML SAX events<a name="XHTML_SAX_events"></a></h3> > +<p>The parsed content of the document stream is returned to the client > application as a sequence of XHTML SAX events. XHTML is used to express > structured content of the document and SAX events enable streamed processing. > Note that the XHTML format is used here only to convey structural > information, not to render the documents for browsing!</p> > +<p>The XHTML SAX events produced by the parser implementation are sent to a > <a class="externalLink" > href="http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html">ContentHandler</a> > instance given to the <tt>parse</tt> method. If this the content handler > fails to process an event, then parsing stops and the thrown <a > class="externalLink" > href="http://docs.oracle.com/javase/6/docs/api/org/xml/sax/SAXException.html">SAXException</a> > is passed up to the client application.</p> > +<p>The overall structure of the generated event stream is (with indenting > added for clarity):</p> > +<div> > +<pre><html xmlns="http://www.w3.org/1999/xhtml"> > + <head> > + <title>...</title> > + </head> > + <body> > + ... > + </body> > +</html></pre></div> > +<p>Parser implementations typically use the <a > href="./apidocs/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> > utility class to generate the XHTML output.</p> > +<p>Dealing with the raw SAX events can be a bit complex, so Apache Tika > comes with a number of utility classes that can be used to process and > convert the event stream to other representations.</p> > +<p>For example, the <a > href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a> > class can be used to extract just the body part of the XHTML output and feed > it either as SAX events to another content handler or as characters to an > output stream, a writer, or simply a string. The following code snippet > parses a document from the standard input stream and outputs the extracted > text content to standard output:</p> > +<div> > +<pre>ContentHandler handler = new BodyContentHandler(System.out); > +parser.parse(System.in, handler, ...);</pre></div> > +<p>Another useful class is <a > href="./api/org/apache/tika/parser/ParsingReader.html">ParsingReader</a> that > uses a background thread to parse the document and returns the extracted text > content as a character stream:</p> > +<div> > +<pre>InputStream stream = ...; // the document to be parsed > +Reader reader = new ParsingReader(parser, stream, ...); > +try { > + ...; // read the document text using the reader > +} finally { > + reader.close(); // the document stream is closed automatically > +}</pre></div></div> > +<div class="section"> > +<h3>Document metadata<a name="Document_metadata"></a></h3> > +<p>The third argument to the <tt>parse</tt> method is used to pass document > metadata both in and out of the parser. Document metadata is expressed as an > <a href="./api/org/apache/tika/metadata/Metadata.html">Metadata</a> > object.</p> > +<p>The following are some of the more interesting metadata properties:</p> > +<dl> > +<dt>Metadata.RESOURCE_NAME_KEY</dt> > +<dd>The name of the file or resource that contains the document. > +<p>A client application can set this property to allow the parser to use > file name heuristics to determine the format of the document.</p> > +<p>The parser implementation may set this property if the file format > contains the canonical name of the file (for example the Gzip format has a > slot for the file name).</p></dd> > +<dt>Metadata.CONTENT_TYPE</dt> > +<dd>The declared content type of the document. > +<p>A client application can set this property based on for example a HTTP > Content-Type header. The declared content type may help the parser to > correctly interpret the document.</p> > +<p>The parser implementation sets this property to the content type > according to which the document was parsed.</p></dd> > +<dt>Metadata.TITLE</dt> > +<dd>The title of the document. > +<p>The parser implementation sets this property if the document format > contains an explicit title field.</p></dd> > +<dt>Metadata.AUTHOR</dt> > +<dd>The name of the author of the document. > +<p>The parser implementation sets this property if the document format > contains an explicit author field.</p></dd></dl> > +<p>Note that metadata handling is still being discussed by the Tika > development team, and it is likely that there will be some (backwards > incompatible) changes in metadata handling before Tika 1.0.</p></div> > +<div class="section"> > +<h3>Parse context<a name="Parse_context"></a></h3> > +<p>The final argument to the <tt>parse</tt> method is used to inject > context-specific information to the parsing process. This is useful for > example when dealing with locale-specific date and number formats in > Microsoft Excel spreadsheets. Another important use of the parse context is > passing in the delegate parser instance to be used by two-phase parsers like > the <a > href="./api/org/apache/parser/pkg/PackageParser.html">PackageParser</a> > subclasses. Some parser classes allow customization of the parsing process > through strategy objects in the parse context.</p></div> > +<div class="section"> > +<h3>Parser implementations<a name="Parser_implementations"></a></h3> > +<p>Apache Tika comes with a number of parser classes for parsing <a > href="./formats.html">various document formats</a>. You can also extend Tika > with your own parsers, and of course any contributions to Tika are warmly > welcome.</p> > +<p>The goal of Tika is to reuse existing parser libraries like <a > class="externalLink" href="http://www.pdfbox.org/">PDFBox</a> or <a > class="externalLink" href="http://poi.apache.org/">Apache POI</a> as much as > possible, and so most of the parser classes in Tika are adapters to such > external libraries.</p> > +<p>Tika also contains some general purpose parser implementations that are > not targeted at any specific document formats. The most notable of these is > the <a > href="./apidocs/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> > class that encapsulates all Tika functionality into a single parser that can > handle any types of documents. This parser will automatically determine the > type of the incoming document based on various heuristics and will then parse > the document accordingly.</p></div></div> > + </div> > + <div id="sidebar"> > + <div id="navigation"> > + <h5>Apache Tika</h5> > + <ul> > + > + <li class="none"> > + <a href="../index.html">Introduction</a> > + </li> > + > + <li class="none"> > + <a href="../download.html">Download</a> > + </li> > + > + <li class="none"> > + <a href="../contribute.html">Contribute</a> > + </li> > + > + <li class="none"> > + <a href="../mail-lists.html">Mailing Lists</a> > + </li> > + > + <li class="none"> > + <a href="http://wiki.apache.org/tika/" > class="externalLink">Tika Wiki</a> > + </li> > + > + <li class="none"> > + <a href="https://issues.apache.org/jira/browse/TIKA" > class="externalLink">Issue Tracker</a> > + </li> > + </ul> > + <h5>Documentation</h5> > + <ul> > + > + > + > + > + > + > + > + > + > + <li class="expanded"> > + <a href="../1.5/index.html">Apache Tika 1.5</a> > + <ul> > + > + <li class="none"> > + <a href="../1.5/gettingstarted.html">Getting Started</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/formats.html">Supported Formats</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/parser.html">Parser API</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/parser_guide.html">Parser 5min Quick > Start Guide</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/detection.html">Content and Language > Detection</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/api/">API Documentation</a> > + </li> > + </ul> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.4/index.html">Apache Tika 1.4</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.3/index.html">Apache Tika 1.3</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.2/index.html">Apache Tika 1.2</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.1/index.html">Apache Tika 1.1</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.0/index.html">Apache Tika 1.0</a> > + </li> > + </ul> > + <h5>The Apache Software Foundation</h5> > + <ul> > + > + <li class="none"> > + <a href="http://www.apache.org/foundation/" > class="externalLink">About</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/licenses/" > class="externalLink">License</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/security/" > class="externalLink">Security</a> > + </li> > + > + <li class="none"> > + <a > href="http://www.apache.org/foundation/sponsorship.html" > class="externalLink">Sponsorship</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/foundation/thanks.html" > class="externalLink">Thanks</a> > + </li> > + </ul> > + > + <div id="search"> > + <h5>Search with Apache Solr</h5> > + <form action="http://search.lucidimagination.com/p:tika" > + method="get" id="searchform"> > + <input type="text" id="query" name="q"/> > + <select name="searchProvider" id="searchProvider"> > + <option value="any">provider</option> > + <option value="lucid">Lucid Find</option> > + <option value="sl">Search-Lucene</option> > + </select> > + <input type="submit" id="submit" value="Search" name="Search" > + onclick="selectProvider(this.form)"/> > + </form> > + </div> > + > + <div id="bookpromo"> > + <h5>Books about Tika</h5> > + <p> > + <a href="http://manning.com/mattmann/" title="Tika in Action" > + ><img src="../mattmann_cover150.jpg" > + width="150" height="186"/></a> > + </p> > + </div> > + </div> > + </div> > + <div id="footer"> > + <p> > + Copyright © 2014 > + <a href="http://www.apache.org/">The Apache Software > Foundation</a>. > + Site powered by <a href="http://maven.apache.org/">Apache > Maven</a>. > + Search powered by > + <a href="http://www.lucidimagination.com">Lucid Imagination</a> > + and <a href="http://sematext.com">Sematext</a>. > + <br/> > + Apache Tika, Tika, Apache, the Apache feather logo, and the Apache > + Tika project logo are trademarks of The Apache Software Foundation. > + </p> > + </div> > + </div> > + </body> > +</html> > > Added: tika/site/publish/1.6/parser_guide.html > URL: > http://svn.apache.org/viewvc/tika/site/publish/1.6/parser_guide.html?rev=1622762&view=auto > ============================================================================== > --- tika/site/publish/1.6/parser_guide.html (added) > +++ tika/site/publish/1.6/parser_guide.html Fri Sep 5 19:14:58 2014 > @@ -0,0 +1,373 @@ > +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > + > +<!-- > + Licensed to the Apache Software Foundation (ASF) under one > + or more contributor license agreements. See the NOTICE file > + distributed with this work for additional information > + regarding copyright ownership. The ASF licenses this file > + to you under the Apache License, Version 2.0 (the > + "License"); you may not use this file except in compliance > + with the License. You may obtain a copy of the License at > + > + http://www.apache.org/licenses/LICENSE-2.0 > + > + Unless required by applicable law or agreed to in writing, > + software distributed under the License is distributed on an > + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY > + KIND, either express or implied. See the License for the > + specific language governing permissions and limitations > + under the License. > +--> > + > + > + > + > + > + > + > +<html xmlns="http://www.w3.org/1999/xhtml"> > + <head> > + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> > + <title>Apache Tika - Get Tika parsing up and running in 5 minutes</title> > + <style type="text/css" media="all"> > + @import url("../css/site.css"); > + </style> > + <link rel="icon" type="image/png" href="../tikaNoText16.png" /> > + <script type="text/javascript"> > + function selectProvider(form) { > + provider = form.elements['searchProvider'].value; > + if (provider == "any") { > + if (Math.random() > 0.5) { > + provider = "lucid"; > + } else { > + provider = "sl"; > + } > + } > + if (provider == "lucid") { > + form.action = "http://find.searchhub.org/p:tika"; > + } else if (provider == "sl") { > + form.action = "http://search-lucene.com/tika"; > + } > + days = 90; > + date = new Date(); > + date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000)); > + expires = "; expires=" + date.toGMTString(); > + document.cookie = "searchProvider=" + provider + expires + "; > path=/"; > + } > + function initProvider() { > + if (document.cookie.length>0) { > + cStart=document.cookie.indexOf("searchProvider="); > + if (cStart!=-1) { > + cStart=cStart + "searchProvider=".length; > + cEnd=document.cookie.indexOf(";", cStart); > + if (cEnd==-1) { > + cEnd=document.cookie.length; > + } > + provider = unescape(document.cookie.substring(cStart,cEnd)); > + document.forms['searchform'].elements['searchProvider'].value = > provider; > + } > + } > + document.forms['searchform'].elements['q'].focus(); > + } > + </script> > + </head> > + <body onLoad="initProvider();"> > + <div id="body"> > + <div id="banner"> > + <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika" > + ><img src="http://tika.apache.org/tika.png" alt="Apache Tika" > + width="292" height="100"/></a> > + <a href="http://www.apache.org/" id="bannerRight" > + title="The Apache Software Foundation" > + ><img src="http://tika.apache.org/asf-logo.gif" alt="The Apache > Software Foundation" > + width="387" height="100"/></a> > + </div> > + <div id="content"> > + <!-- Licensed to the Apache Software Foundation (ASF) under one or > more --><!-- contributor license agreements. See the NOTICE file distributed > with --><!-- this work for additional information regarding copyright > ownership. --><!-- The ASF licenses this file to You under the Apache > License, Version 2.0 --><!-- (the "License"); you may not use this file > except in compliance with --><!-- the License. You may obtain a copy of the > License at --><!-- --><!-- http://www.apache.org/licenses/LICENSE-2.0 > --><!-- --><!-- Unless required by applicable law or agreed to in writing, > software --><!-- distributed under the License is distributed on an "AS IS" > BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express > or implied. --><!-- See the License for the specific language governing > permissions and --><!-- limitations under the License. --><div > class="section"> > +<h2>Get Tika parsing up and running in 5 minutes<a > name="Get_Tika_parsing_up_and_running_in_5_minutes"></a></h2> > +<p>This page is a quick start guide showing how to add a new parser to > Apache Tika. Following the simple steps listed below your new parser can be > running in only 5 minutes.</p> > +<ul> > +<li><a href="#Get_Tika_parsing_up_and_running_in_5_minutes">Get Tika parsing > up and running in 5 minutes</a> > +<ul> > +<li><a href="#Getting_Started">Getting Started</a></li> > +<li><a href="#Add_your_MIME-Type">Add your MIME-Type</a></li> > +<li><a href="#Create_your_Parser_class">Create your Parser class</a></li> > +<li><a href="#List_the_new_parser">List the new > parser</a></li></ul></li></ul> > +<div class="section"> > +<h3><a name="Getting_Started">Getting Started</a></h3> > +<p>The <a href="./gettingstarted.html">Getting Started</a> document > describes how to build Apache Tika from sources and how to start using Tika > in an application. Pay close attention and follow the instructions in the > "Getting and building the sources" section.</p></div> > +<div class="section"> > +<h3><a name="Add_your_MIME-Type">Add your MIME-Type</a></h3> > +<p>Tika loads the core, standard MIME-Types from the file > "org/apache/tika/mime/tika-mimetypes.xml", which comes from <a > class="externalLink" > href="http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml">tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml</a> > . If your new MIME-Type is a standard one which is missing from Tika, submit > a patch for this file!</p> > +<p>If your MIME-Type needs adding, create a new file > "org/apache/tika/mime/custom-mimetypes.xml" in your codebase. You > should add to it something like this:</p> > +<div> > +<pre> <?xml version="1.0" encoding="UTF-8"?> > + <mime-info> > + <mime-type type="application/hello"> > + <glob pattern="*.hi"/> > + </mime-type> > + </mime-info></pre></div></div> > +<div class="section"> > +<h3><a name="Create_your_Parser_class">Create your Parser class</a></h3> > +<p>Now, you need to create your new parser. This is a class that must > implement the Parser interface offered by Tika. Instead of implementing the > Parser interface directly, it is recommended that you extend the abstract > class AbstractParser if possible. AbstractParser handles translating between > API changes for you.</p> > +<p>A very simple Tika Parser looks like this:</p> > +<div> > +<pre>/* > + * Licensed to the Apache Software Foundation (ASF) under one or more > + * contributor license agreements. See the NOTICE file distributed with > + * this work for additional information regarding copyright ownership. > + * The ASF licenses this file to You under the Apache License, Version 2.0 > + * (the "License"); you may not use this file except in compliance > with > + * the License. You may obtain a copy of the License at > + * > + * http://www.apache.org/licenses/LICENSE-2.0 > + * > + * Unless required by applicable law or agreed to in writing, software > + * distributed under the License is distributed on an "AS IS" > BASIS, > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > + * See the License for the specific language governing permissions and > + * limitations under the License. > + * > + * @Author: Arturo Beltran > + */ > +package org.apache.tika.parser.hello; > + > +import java.io.IOException; > +import java.io.InputStream; > +import java.util.Collections; > +import java.util.Set; > + > +import org.apache.tika.exception.TikaException; > +import org.apache.tika.metadata.Metadata; > +import org.apache.tika.mime.MediaType; > +import org.apache.tika.parser.ParseContext; > +import org.apache.tika.parser.AbstractParser; > +import org.apache.tika.sax.XHTMLContentHandler; > +import org.xml.sax.ContentHandler; > +import org.xml.sax.SAXException; > + > +public class HelloParser extends AbstractParser { > + > + private static final Set<MediaType> SUPPORTED_TYPES = > Collections.singleton(MediaType.application("hello")); > + public static final String HELLO_MIME_TYPE = > "application/hello"; > + > + public Set<MediaType> getSupportedTypes(ParseContext context) { > + return SUPPORTED_TYPES; > + } > + > + public void parse( > + InputStream stream, ContentHandler handler, > + Metadata metadata, ParseContext context) > + throws IOException, SAXException, TikaException { > + > + metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE); > + metadata.set("Hello", "World"); > + > + XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, > metadata); > + xhtml.startDocument(); > + xhtml.endDocument(); > + } > +}</pre></div> > +<p>Pay special attention to the definition of the SUPPORTED_TYPES static > class field in the parser class that defines what MIME-Types it supports. If > your MIME-Types aren't standard ones, ensure you listed them in a > "custom-mimetypes.xml" file so that Tika knows about them (see > above).</p> > +<p>Is in the "parse" method where you will do all your work. This > is, extract the information of the resource and then set the > metadata.</p></div> > +<div class="section"> > +<h3><a name="List_the_new_parser">List the new parser</a></h3> > +<p>Finally, you should explicitly tell the AutoDetectParser to include your > new parser. This step is only needed if you want to use the AutoDetectParser > functionality. If you figure out the correct parser in a different way, it > isn't needed. </p> > +<p>List your new parser in: <a class="externalLink" > href="http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser">tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser</a></p></div></div> > + </div> > + <div id="sidebar"> > + <div id="navigation"> > + <h5>Apache Tika</h5> > + <ul> > + > + <li class="none"> > + <a href="../index.html">Introduction</a> > + </li> > + > + <li class="none"> > + <a href="../download.html">Download</a> > + </li> > + > + <li class="none"> > + <a href="../contribute.html">Contribute</a> > + </li> > + > + <li class="none"> > + <a href="../mail-lists.html">Mailing Lists</a> > + </li> > + > + <li class="none"> > + <a href="http://wiki.apache.org/tika/" > class="externalLink">Tika Wiki</a> > + </li> > + > + <li class="none"> > + <a href="https://issues.apache.org/jira/browse/TIKA" > class="externalLink">Issue Tracker</a> > + </li> > + </ul> > + <h5>Documentation</h5> > + <ul> > + > + > + > + > + > + > + > + > + > + <li class="expanded"> > + <a href="../1.5/index.html">Apache Tika 1.5</a> > + <ul> > + > + <li class="none"> > + <a href="../1.5/gettingstarted.html">Getting Started</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/formats.html">Supported Formats</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/parser.html">Parser API</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/parser_guide.html">Parser 5min Quick > Start Guide</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/detection.html">Content and Language > Detection</a> > + </li> > + > + <li class="none"> > + <a href="../1.5/api/">API Documentation</a> > + </li> > + </ul> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.4/index.html">Apache Tika 1.4</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.3/index.html">Apache Tika 1.3</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.2/index.html">Apache Tika 1.2</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.1/index.html">Apache Tika 1.1</a> > + </li> > + > + > + > + > + > + > + > + > + > + <li class="collapsed"> > + <a href="../1.0/index.html">Apache Tika 1.0</a> > + </li> > + </ul> > + <h5>The Apache Software Foundation</h5> > + <ul> > + > + <li class="none"> > + <a href="http://www.apache.org/foundation/" > class="externalLink">About</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/licenses/" > class="externalLink">License</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/security/" > class="externalLink">Security</a> > + </li> > + > + <li class="none"> > + <a > href="http://www.apache.org/foundation/sponsorship.html" > class="externalLink">Sponsorship</a> > + </li> > + > + <li class="none"> > + <a href="http://www.apache.org/foundation/thanks.html" > class="externalLink">Thanks</a> > + </li> > + </ul> > + > + <div id="search"> > + <h5>Search with Apache Solr</h5> > + <form action="http://search.lucidimagination.com/p:tika" > + method="get" id="searchform"> > + <input type="text" id="query" name="q"/> > + <select name="searchProvider" id="searchProvider"> > + <option value="any">provider</option> > + <option value="lucid">Lucid Find</option> > + <option value="sl">Search-Lucene</option> > + </select> > + <input type="submit" id="submit" value="Search" name="Search" > + onclick="selectProvider(this.form)"/> > + </form> > + </div> > + > + <div id="bookpromo"> > + <h5>Books about Tika</h5> > + <p> > + <a href="http://manning.com/mattmann/" title="Tika in Action" > + ><img src="../mattmann_cover150.jpg" > + width="150" height="186"/></a> > + </p> > + </div> > + </div> > + </div> > + <div id="footer"> > + <p> > + Copyright © 2014 > + <a href="http://www.apache.org/">The Apache Software > Foundation</a>. > + Site powered by <a href="http://maven.apache.org/">Apache > Maven</a>. > + Search powered by > + <a href="http://www.lucidimagination.com">Lucid Imagination</a> > + and <a href="http://sematext.com">Sematext</a>. > + <br/> > + Apache Tika, Tika, Apache, the Apache feather logo, and the Apache > + Tika project logo are trademarks of The Apache Software Foundation. > + </p> > + </div> > + </div> > + </body> > +</html> > >