OK was able to merge these in -didn't stomp over what I was doing. Thanks Nick, no worries!
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architct Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Ajunct Associate Professor, Computer Science Department University of Southrn California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: <Mattmann>, Chris Mattmann <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, September 5, 2014 12:19 PM To: "<devtika.apache.org>" <[email protected]> Cc: "[email protected]" [email protected]> Subject: Re: svn commit: r1622762 [1/2] - in /tika/site/publish: 1.4/gettingstarted.html 1.5/gettingstarted.html 1.6/detection.html 1.6/formats.html 1.6/gettingstarted.html 1.6/parser.html 1.6/parser_guide.html 1.7/ 1.7/examples.html 1.7/formats.html index.html >Nick I'm working on this cam u hold off? > >Sent from my iPone > >> On Sep 5, 2014, at 12:15 PM, "[email protected]" <[email protected]> wrote: >> >> Author: nick >> Date: Fri Sep 5 19:14:58 2014 >> New Revision: 1622762 >> >> URL: http://svn.apache.org/r162272 >> Log: >> Republish the site >> >> Added: >> tika/site/publish/1./detection.html >> tika/site/publish/1.6/gettingstarted.html >> tika/site/publish/16/parser.html >> tika/site/publish/1.6/parser_guide.html >> tika/site/pblish/1.7/ >> tika/site/publish/1.7/examples.html >> tika/site/publish/1.7/formats.html >> Modified: >> tika/site/publish/1.4/gettingstarted.html >> tika/site/publish/1.5/gettingstarted.html >> ika/site/publish/1.6/formats.html >> tika/site/publish/index.html >> >> Modified: tika/site/publish/1.4/gettingstarted.html >> URL: >>http://vn.apache.org/viewvc/tika/site/publish/1.4/gettingstarted.html?re >>v=162272&r1=1622761&r2=1622762&view=diff >> >>========================================================================= >>===== >> --- tika/site/publish/1.4/gettingstarted.html (original) >> +++ tika/sit/publish/1.4/gettingstarted.html Fri Sep 5 19:14:58 2014 >> @@ -94,13 +9413 @@ >> <div> >> <pre>mvn install</pre></div> >> <p>See the Maven documentation for more information about the avalable >>build options.</p> >> -<p>Note that you need Java 5 or higher to build Tika.</p></div> >> +<p>Note that you need Java 6 or higher to build Tika.</p></div> >> <div class="section"> >> <h>Build artifacts<a name="Build_artifacts"></a></h2> >> <p>The Tika build consists of a number of components and produces the >>following main binaries:</p> >> <dl> >> <dt>tika-core/target/tika-core*.jar</dt> >> -<dd> Tika core library. Contains the core interfaces and classes of >>Tika, but none of the parser implementtions. Depends only on Java >>5.</dd> >> +<dd> Tika core library. Contais the core interfaces and classes of >>Tika, but none of the parser implementations. Depends only on Java >>6.</dd> >> <dt>tika-parsers/target/tika-parsers-*.jar</dt> >> <dd> Tika parsers. Collection of classes that implement he Tika Parser >>interface based on various external parser libraries.</dd >> <dt>tika-app/target/tika-app-*.jar</dt> >> >> Modified: tika/site/ublish/1.5/gettingstarted.html >> URL: >>http://svn.apache.org/viewvc/tikasite/publish/1.5/gettingstarted.html?re >>v=1622762&r1=1622761&r2=162762&view=diff >> >>======================================================================== >>==== >> --- tika/site/publish/1.5/gettingstarted.html (original) >> ++ tika/site/publish/1.5/gettingstarted.html Fri Sep 5 19:14:58 2014 >> @@ -94,13 +94,13 @@ >> <div> >> <pre>mvn install</pre></div> >> <p>See the Maven documentation for more information about the available >>build options.</p> >> -<p>Not that you need Java 5 or higher to build Tika.</p></div> >> +<p>Note that ou need Java 6 or higher to build Tika.</p></div> >> <div class="section">>> <h2>Build artifacts<a >> name="Build_artifacts"></a></h2> >> <p>The Tika build consists of anumber of components and produces the >>following main binaries:</p> > <dl> >> <dt>tika-core/target/tika-core-*.jar</dt> >> -<dd> Tika core lbrary. Contains the core interfaces and classes of >>Tika, but none of the arser implementations. Depends only on Java >>5.</dd> >> +<dd> Tika core library. Contains the core interface and classes of >>Tika, but none of the parser implementations. Depends oly on Java >>6.</dd> >> <dt>tika-parsers/target/tika-parsers-*.jar</dt> >> <dd> Tika parsers. Collection of classes that implement the Tika Parser >>interface based on arious external parser libraries.</dd> >> <dt>tika-app/target/tikaapp-*.jar</dt> >> >> Added: tika/site/publish/1.6/detection.html >> RL: >>http://svn.apache.org/viewvc/tika/site/publish/.6/detection.html?rev=162 >>2762&view=auto >> >>========================================================================= >>===== >> --- tika/site/publish/1.6/detection.html (added) >> +++ tika/site/publish/1.6/detection.html Fri Sep 5 19:14:58 2014 >> @@ -0,0 +1,357 @@ >> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" >> + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> + >> +<!-- >> + Licensed to the Apache Software Foundation (ASF) under one >> + or more contributor license agreements. See the NOTICE file >> + distributed with this work for additional information >> + regarding copyright ownership. The ASF licenses this file >> + to you under the Apache License, Version 2.0 (the >> + "License"); you may not use this file except in compliance >> + wih the License. You may obtain a copy of the License at >> + >> + http://www.apache.org/licenses/LICENSE-2.0 >> + >> + Unles required by applicable law or agreed to in writing,>> + software >> distributed under the License is distributed on an >> + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY >> + KIND, either express or implied. See the License for the >> + fic language governipermissions and limiions >> + under thecense. >> +--> >> + + >> + >> + >> + >> > + >> +<html xmlns="http://www.w3.org/1999/xhtml"> >> + <head> >> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >>/> >> + <title>Apache Tika ontent Detection</title> >> + <style type="text/css" media="all"> >> + @import url("../ss/site.css"); >> + </style> >> + <link rel="icotype="image/png" href="../tikaNoText16.png" /> >> + <script type="text/javascript"> >> + function selectProvider(form) { >> + pider = form.elements['searchProvider'].value; >> + if (provider == "any") { >> + if (Math.random() > 0.5) { >> + provider = "lucid"; >> + } else { >> + provider = "sl"; >> + } >>+ } >> + if (provider == "lucid") { >> + m.action = "http://find.searchhub.org/p:tika"; >> + } else if (provider == "sl") { >> + form.action = "http://search-lucene.com/tika"; >> + } >> + days = 90; >> + date = new Date(); >> + date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000)); >> + expires = "; expires=" + date.toGMTString(); >> + document.cookie = chProvider=" + provi + expires + "; >>pa/"; >> + } >> + function initProvr() { >> + ifocument.cookie.length>0) { >> + cStart=document.cookie.indexOf("searchProvider="); >> + if (cStart!=-1) { >> + cStart=cStart + "searchProvider=".length; >> + cEnd=document.e.indexOf(";", cStar >> + if nd==-1) { >> + cEnd=document.cie.length; >> + } >> + provider = >>unescape(document.cookie.substring(cStart,cEnd)); >> + >>document.forms['searchform'].elements['searchProvider'].value = provider; >> + } >> + } >> + document.forms['schform'].elements['qfocus(); >> + } + </script> >> +/head> >> + <body oad="initProvider();"> >> + <div id="body"> >> + <div id="banner"> >> + <a href="http://tika.apache.org" id="bannerLeft" title="Apache >>Tika" >> + ><img src="http://tia.apache.org/tika.pnt="Apache Tika" >> + width="" height="100"/></a> + <a >> href="h://www.apache.org/" "bannerRight" >> + title="The Apache Software Foundation" >> + ><img src="http://tika.apache.org/asf-logo.gif" alt="The >>Apache Software Foundation" >> + with="387" height="100a> >> + </div> + <div id="cont"> >> + <!-- ensed to the Apache tware Foundation (ASunder >>one or more --><!-- contributor license agreements. See the NOTICE file >>distributed with --><!-- this work for additional information regarding >>copyright ownership. --><!-- The ASF licenses this file to You under the >>Apache License, Version 2.0 --><!-- (the "License"); you may not use >>this file except in compliance with --><!-- the License. Youmay obtain >>a copy of the License at --><!-- --><!-- >>http://www.apache.org/licenses/LICENSE-2.0 --><!-- --><!-- Unless >>required by applicable law or agreed t in writing, software --><!-- >>distributed under the License is distributed on an "AS IS" BASIS, >>--><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either expres or >>implied. --><!-- See the License for the specific language governing >>permissions and --><!-- limitations under the Licese. --><div >>class="section"> >> +<h2>Content Detection<a ame="Content_Detection"></a></h2> >> +<p>This page gives you information on how content and language >>detection works with Apache ika, and how to tune the behaviour of >>Tika.</p> >> +ul> >> +<li><a href="#Content_Detection">Content Detection</a> >> +<ul> >> +<li><a href="#The_Detector_Interface">The Detector Interface</a></li> >> +<li><a href="#Mime_Magic_Detction">Mime Magic Detction</a></li> >> +<li><a href="#Resource_Name_Based_Detection">Resource Name Based >>Detection</a></li> >> +<li><a href="#Known_Content_Type_Detection">Known Content Type >>"Detection</a></li> >> +<li><a href="#The_default_Mime_Types_Detector">The default Mime Types >>Detector</a></li> >> +<li><a href="#Container_Aware_Detection">Container Aware >>Detection</a></li> >> +<li><a href="#The_default_Tika_Detector">The default Tika >>Detector</a></li> >> +<li><a href="#Ways_of_triggering_Deection">Ways of triggering >>Detection</a></li> >> +<li><a href="#Language_Detection">Language >>Detection</a></li></ul></li></ul> >> +<div class="section"> >> +<h3><a name="The_Detector_Interface">The Detector Interface</a></h3> >> +<p>The <a >>href="./api/org/apache/tika/detect/Detector.html">og.apache.tika.detect. >>Detector</a> interface is the basis for most of the content type >>detection in Apache Tika. All the different ways of detecting content >>all implement the same common method:</p> >> +<div> >> +<pre>MediaType detect(java.io.InputStream input, >> + Metadata metadata) throws >>java.io.IOException</pre></div> > +<p>The <tt>detect</tt> method takes the stream to inspect, and a >><tt>Metadata</tt> obect that holds any additional information on the >>content. The detector will return a <a >>href="./api/org/pache/tika/mime/MediaType.html">MediaType</a> object >>describing its best guess as to the type of the file.</p> >> +<p>In general, only two keys on the Metadata object are ued by >>Detectors. These are <tt>Metadata.RESOURCE_NAME_KEY</tt> which should >hold the name of the file (where known), and >><tt>Metadata.CONTENT_TYPE</tt> which should hold the advertised content >>type of the file (eg from a webserver or a content repository).</p></div> >> +<div class="section"> >> +<h3><a name="Mime_Magic_Detction">Mime Magic Detction</a></h3> >> +<p>By looking for special ("magic") patterns of bytes near >>the start of the file, it is often possible to detect the type of the >>file. For some file types, this is a simple process. For others, >>typically container based formats, the magic detection may not be >>enough. (More detail on detecting container formats below)</p> >> +<p>Tika is able to make use of a a mime magic info file, in the <a >>class="externalLink" >>href="http://www.freedesktop.org/standards/shared-mime-info">Freedesktop >>MIME-info</a> format to peform mime magic detection. (Note that Tika >>supports a few more match types than Freedesktop does)</p> >> +<p>This is provided within Tika by <a >>href="./api/org/apache/tika/detect/MagicDetector.html">org.apache.tika.de >>tect.MagicDetector</a>. It is most commonly access via <a >>href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.Mim >>eTypes</a>, normally sourced from the <tt>tika-mimetypes.xml</tt> and >><tt>custom-mimetypes.xml</tt> files. For more information on defining >>your own custom mimetypes, see <a >>href="./parser_guide.html#Add_your_MIME-Type">the new parser >>guide</a>.</p></div> >> +<div class="section"> >> +<h3><a name="Resource_Name_Based_Detection">Resource Name Based >>Detection</a></h3> >> +<p>Where the name of the file is known, it is sometimes possible to >>guess the file type frm the name or extension. Within the >><tt>tika-mimetypes.xml</tt> file s a list of patterns which are used to >>identify the type from the filename.</p> >> +<p>However, because files may be renamed, this method of detection is >>quick but not always as accurate.</p> >> +<p>This is provided within Tika by <a >>href="./api/org/apache/tika/detect/NameDetector.html">org.apache.tika.det >>ect.NameDetector</a>.</p></div> >> +<div class="section"> >> +<h3><a name="Known_Content_Type_Detection">Known Content Type >>"Detection</a></h3> >> +<p>Sometimes, the mime type for a file is already known, such as when >>downloading from a webserver, or when retrieving from a content store. >>This information can be used by detectors, such as <a >>href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.Mim >>eTypes</a>,</p></div> >> +<div class="section"> >> +<h3><a name="The_default_Mime_Types_Detector">The default Mime Types >>Detector</a></h3> >> +<p>By default, the mime type detection in Tika is provided by <a >>href="./api/org/apache/tika/mime/MimeTypes.html">org.apahe.tika.mime.Mim >>eTypes</a>. This detector makes use of <tt>tika-mimetypes.xml</tt> to >>power magic based and filename based detection.</p> >> +<p>Firstly, magic based detection is used on the start of the file. If >>the file is an XML file, then the start of the XML is processed to look >>for root elements. Next, if available, the filename (from >><tt>Metadata.RESOURCE_NAME_KEY</tt>) is then used to improve the detail >>of the detecton, such as when magic detects a text file, and the >>filename hints it's really a CSV. Finally, if available, the supplied >>content type (from <tt>Metadata.CONTENT_TYPE</tt>) is used to further >>refine the type.</p></div> >> +<div class="section"> >> +<h3><a name="Container_Aware_Detection">Container Aware >>Detection</a></h3> >> +<p>Several common file formats are actually held within a common >>container format. One example is the PowerPoit .ppt and Word .doc >>formats, which are both held within an OLE2 container. Another is Apple >>iWork formats, which are actually a series of XML files within a Zip >>file.</p> >> +<p>Using magic detection, it is easy to spot that a given file is an >>OLE2 document, or a Zip file. Using magic detection alone, it is >>very>>difficult (and often impossible) to tell what kind of file lives inside >>the container.</p> >> +<p>For some use cases, speed is important, so having a quick way to >>know the container type is suficient. For other cases however, you >>don't mind spending a bit of time (ad memory!) processing the container >>to get a more accurate answer on itscontents. For these cases, the >>additional container aware detectors contaned in the <tt>Tika >>Parsers</tt> jar should be used.</p> >> +<p>Tika rovides a wrapping detector in the form of <a >>href="./api/org/apache/tikadetect/DefaultDetector.html">org.apache.tika >>detect.DefaultDetector</a>. This uses the service loader to discovr all >>available detectors, including any available container aware ons, and >>tries them in turn. For container aware detection, include he <tt>Tika >>Parsers</tt> jar and its dependencies in your project, then se >>DefaultDetector along with a <tt>TikaInputStream</tt>.</p> >> +<p>ecause these container detectors needs to read the whole file to >open and inspect the container, they must be used with a <a >>href="./apiorg/apache/tika/io/TikaInputStream.html">org.apache.tika.io.T >>ikaInputStream</a>. If called with a regular <tt>InputStream</tt>, then >>al work will be done by the default Mime Magic detection only.</p> >> +<p>Fo more information on container formats and Tika, see <a >>class="externalLink" >>href="http://wiki.apache.org/tika/MetdataDiscussion"></a></p></div> >> +<div class="section"> >> +<h3><a name="The_default_Tika_Detector">The default Tika >>Detector</a></h3> >> +<p>Just as with Parsers, Tika provides a special detector <a >>href="./api/org/apache/tika/detect/efaultDetector.html">org.apache.tika. >>detect.DefaultDetector</a> whch auto-detects (based on service files) >>the available detectors at runtime, and tries these in turn to identify >>the file type.</p> >> +<p>If only <tt>Tika Core</tt> is available, the Deault Detector will >>work only with Mime Magic and Resource Name detectio. However, if >><tt>Tika Parsers</tt> (and its dependencies!) are availabl, additional >>detectors which known about containers (such as zip and ole2) will be >>used as appropriate, provided that detection is being performed with a >><a >>href="./api/org/apache/tika/io/TikaInputStream.html">org.apahe.tika.io.T >>ikaInputStream</a>. Custom detectors can also be used as desred, they >>simply needto be listed in a service file much as is done for <a >>href="./parser_guide.html#List_the_new_parser">custom >>parsers</a>.</p></div> >> +<div class="section"> >> +<h3><a name="Ways_of_triggering_Detection">Ways of triggering >Detection</a></h3> >> +<p>The simplest way to detect is through the a >>href="./api/org/apache/tika/Tika.html">Tika Facade class</a>, whih >>provides methods to detect based on <a >>href="./api/org/apache/tika/Tka.html#detect(java.io.File)">File</a>, <a >>href="./api/org/apache/tika/Tika.html#detect(java.io.InputStream)">InputS >>tream</a>, <a >>href="./api/org/apache/tika/Tika.html#detect(java.io.InputStream, >>java.lang.String)">InputStream and Filename</a>, <a >>href="./api/org/apache/tika/Tika.html#detect(java.lang.String)">Filename< >>/a> or a few others. It works best with a File or <a >>href="./api/org/apache/tika/io/TikInputStream.html">TikaInputStream</a>. >></p> >> +<p>Alternately, detection can be performe on a specific Detector, or >>using <tt>DefaultDetector</tt> to have all aailable Detectors used. A >>typical pattern would be something like:</p >> +<div> >> +<pre>TikaConfig tika = new TikaConfig(); >> + >> +for(File f : myListOfFiles) { >> + Metadata metadata = new Metadata(); >> + metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString()); >> + String mimetype = tika.getDetector().detect( >> + TikaInputStream.get(f), metadata); >> + System.out.println("File quot; + f + " is " + >>mimetype); >> +} >> +for (InputStream is : myListOfStreams) { >> + String mimetype = tika.getDetector().detect( >> + TikaInputStream.get(is), new Metadata()); >> + System.out.println("Sream " + is + " is " + >>mimetype); >> +}</pre></div></div> >> +<div class="section"> >> +<h3><a name="Language_Detection">Language Detection</a></h3> >> +<p>Tika is able to help identify the language of a piece of text, >>which is useful when extracting text from document ormats which do not >>include language information in their metadata.</p> >> +<p>The langage detection is provided by <a >>href="./api/org/apache/tika/language/LanguageIdentifier.html">org.apache. >>tika.language.LanguageIdentifier</a></p></div></div> >> + </div> >> + <div id="sidebar"> >> + <div id="navigation"> >> + <h5>Apache Tika</h5> >> + <ul> >> + >> + <li class="none"> >> + <a href="../index.html">Introduction</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../download.html">Download</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../contribute.html">Contribute</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../mail-lists.html">Mailing Lists</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://wiki.apache.org/tika/" >>class="externalLink">Tika Wii</a> >> + </li> >> + >> + <li class="none"> >> + <a >>href="https://isses.apache.org/jira/browse/TIKA" >>class="externalLink">Issue Tracker</a> >> + </li> >> + </ul> >> + <h5>Documentation</h5> >> + <ul> >> + >> + >> + >> + >> + >> + >> + > + >> + >> + <li class="expanded"> >> + <a href="../1.5/index.html">Apache Tika 1.5</a> >> + <ul> >> + >> + <li class="none"> >> + <a href="../1.5/gettingstarted.html">Getting >>Started</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/formats.html">Supported Formats</a> >> + /li> >> + >> + <li class="none"> >> + <a href="../1.5/parser.html">Parser API</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../.5/parser_guide.html">Parser 5min >>Quick Start Guide</a> >> + </li> >> + > + <li class="none"> >> + <a href="../1.5/detetion.html">Content and >>Language Detection</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/api/">API Documentation</a> >> + </li> >> + </ul> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.4/index.html">Apache Tika 1.4</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.3/index.html">Apache Tika 1.3</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.2/index.html">Apache Tika 1.2</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.1/index.html">Apache Tika 11</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.0/index.html">Apache Tika 1.0</a> >> + </li> >> + </ul> >> + The Apache Software ndation</h5> >> + <ul> >> + >> + <li cs="none"> >> + <a href="http://www.apache.org/foundation/" >>class="externalLink">About</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/licenses/" >>class="externalLink">License</a> >> + </i> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/security/" >>class="externalLink">Security</a> >> + </li> >> + >> + <llass="none"> >> + <a >>href="http://www.apache.org/foundation/sponsorship.html" >>class="externalLink">Sponsorship< >> + </li> >> + >> + <li class="none"> >> + <a >>hrf="http://www.apache.org/foundation/thanks.html" >>class="externink">Thanks</a> >> + </li> >> + </ul> >> + >> + <div id="searh"> >> + <h5>Search with Apache Solr</h5> >> + <form action="http://search.lucidimagination.com/p:tika" >> + method="get" id="searchform"> >> + <input type="text" id="query" name="q"/> >> + <select ame="searchProvider"searchProvider"> >> <optioalue="any">provider<tion> >> + <option value="lu">Lucid Find</option> >> + <option value="sl">Search-Lucene</option> >> + </select> >> + <input type="submit" id="submit" value="Search" >>name="Searc" >> + onclick="selectProer(this.form)"/> >> </form> >> </div> >> + + <div id=okpromo"> >> + <h5>Books about Tika</h5> >> + <p> >> + <a href="http://manning.com/mattmann/" title="Tika in >>Action" >> + ><img src="../matmann_cover150.jpg" > th="150" height="186</a> >> + p> >> + </d >> + </div> + </div> >> + <div id="footer"> >> + <p> >> + Copyright © 2014 >> + <a href="http://www.apache.org/">The Apache Software >>Foundation</a>. >> + Site powered by <a "http://maven.apacheg/">Apache >>Maven</ >> + Searpowered by >> + <a href="http://wwucidimagination.com">Lucid >>Imagination</a> >> + and <a href="http://sematext.com">Sematext</a>. >> + <br/> >> + Apache Tika, Tika, Apache, the Apache feather logo and the >>Apache >> Tika project o are trademarks of Apache Software >>Fdation. >> + > >> + </div> > </div> >> + </body> >> +</html> >> >> Modified: tika/site/publish/1.6/formats.html >> URL: >>http://svn.apache.org/viewvc/tika/site/publish/1.6/formats.html?rev=16227 >>62&r1=1622761&r2=1622762&view=diff >> >>========================================================================= >>===== >> --- tika/site/publish/1.6/formats.html (original) >> +++ tika/ite/publish/1.6/formats.html Fri Sep 5 19:14:58 2014 >> @@ -110,7 +110,9 @@ >> <li><a href="#Mail_formats">Mail formats</a></li> >> <li><a href="#CAD_formats">CAD ormats</a></li> >> <li><a href="#Font_formats">Font formats</a></li> >> -<li><a href="#Executable_programs_and_libraries">Executable programs >>and libraries</a></li>/ul></li></ul> >> +<li><a href="#Scientific_formats">Scientific formats</a></li> >> +<li><a href="#Executable_programs_and_libraies">Executable programs >>and libraries</a></li> >> +<li>< href="#Crypto_formats">Crypto formats</a></li></ul></li></ul> >> <div class="section"> >> <h3><a name="HyperText_Markup_Language">yperText Markup >>Language</a></h3> >> <p>The HyperTex Markup Language (HTML) is the lingua franca of the >>web. Tika uses the <a class="externalLink" >>href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> library to >>support virtually any kind of HTML found on the web. The output from the >><a >>href="./api/org/apache/tika/parser/html/HtmlParser.html">HtmlParser</a> >>class is guaranteed to be well-formed and valid XHTML, and various >>heuristics are used to prevent things like inline scripts from >>cluttering the extracted text content.</p></div> >> @@ -131,7 +133,8 @@ >> <p>The <a >>href="./api/org/apache/tika/parser/pdf/PDFParser.html">PDFParser</a> >>class parsers Portable Document Format (PDF) documents using the <a >>class"externalLink" href="http://pdfbox.apache.org/">Apache PDFBox</a> >>library.</p></div> >> <div class="section"> >> <h3><a name="Electronic_Publication_Format">Electronic Publication >>Format</a></h3> >> -<p>The <a >>href="./api/org/apache/tika/parser/epub/EpubParser.html">EpubParser</a> >>class spports the Electronic Publication Format (EPUB) used for many >>digital books.</p></div> >> +<p>The <a >>href="./api/org/apache/tika/parser/epub/EpubParser.html">EpubParser</a> >>class supports the Electronic Publication Format (EPUB) used for many >>digital books.</p> >> +<p>The <a >>href="./api/org/apache/tika/parser/xml/FictionBookParser.html>FictionBoo >>kParser</a> class supports the xml-based Fiction Book publishing >>format.</></div> >> <div class="section"> >> <h3><a name="Rich_Text_Format">Rich Text Format</a></h3> >> <p>The <a >href="./api/org/apache/tika/parser/rtf/RTFParser.html">RTFParser</a> >>class uses the standard javax.swing.text.rtf feature to extract text >>content from Rich Text Format (RF) documents.</p></div> >> @@ -143,7 +146,8 @@ >> <p>Extracting text content frm plain text files seems like a simple >>task until you start thinking of all the possible character encodings. >>The <a >>href="./api/org/apache/tika/parser/txt/TXTParser.html">TXTParser</a> >>class uses encoding detection code from the <a class="externalLink" >>href="http://site.icu-project.org/">ICU</a> project to automatically >>detect the character encoding of a text document.</p></div> >> <div class="section"> >> <h3><a name="Feed_and_Syndication_formats">Feed and Syndication >>formats</a></h3> >> -<p>The <a >>href="./api/org/apache/tika/parser/feed/FeedParser.html">FeedParser</a> >>class supports the RSS and Atom feed syndication formats.</p></div> >> +<p>The <a >>href="./api/org/apache/tika/parser/feed/FeedParser.html">FeedParser</a> >>class supports the RSS and Atom feed syndication formats.</p> >> +<p>The <a >>href="./api/org/apache/tika/parser/iptc/IptcAnpaParser.html">IptcAnpaPars >>er</a> class supports the IPTC ANPA News Wire feed format.</p></div> >> <div class="section"> >> <h3><a name="Help_formats">Help formats</a></h3> >> <p>The <a >>href="./api/org/apache/tika/parser/chm/ChmParser.html">ChmParser</a> >>class supports the CHM Help format.</p></div> >> @@ -167,6 +171,7 @@ >> <div class="section"> >> <h3><a name="Mail_formats">Mail formats</a></h3> >> <p>The <a >>href="./api/org/apache/tika/parser/mbox/MboxParser.html">MboxParser</a> >>can extract email messages from the mbox format used by many email >>archives and Unix-style mailboxes.</p> >> +<p>The <a >>href="./api/org/apache/tika/parser/mail/RFC822Parser.html">RFC822Parser</ >>a> can process single email messages in the RFC 822 format used by many >>email clients in their archives / exports.</p> >> <p>The <a >>href="./api/org/apache/tika/parser/mbox/PSTParser.html">PSDParser</a> >>can extract email messages from the Microsoft Outlook PST email >>format.</p></div> >> <div class="section"> >> <h3><a name="CAD_formats">CAD formats</a></h3> >> @@ -175,8 +180,16 @@ >> <h3><a name="Font_formats">Font formats</a></h3> >> <p>The <a >>href="./api/org/apache/tika/parser/font/TrueTypeParser.html">TrueTypePars >>er</a> class can extract simple metadata from the TrueType font format. >>The <a >>href="./api/org/apache/tika/parser/font/AdobeFontMetricParser.html">Adobe >>FontMetricParser</a> class does something similar for Adobe Font Metrics >>files.</p></div> >> <div class="section"> >> +<h3><a name="Scientific_formats">Scientific formats</a></h3> >> +<p>The <a >>href="./api/org/apache/tika/parser/hdf/HDFParser.html">HDFParser</a> is >>able to extract attribute metadata from the HDF scientific file >>format.</p> >> +<p>The <a >>href="./api/org/apache/tika/parser/netcdf/NetCDFParser.html">NetCDFParser >></a> is able to extract attribute metadata from the NetCDF scientific >>file format.</p> >> +<p>The <a >>href="./api/org/apache/tika/parser/mat/MatParser.html">MatParser</a> is >>able to extract attribute metadata from the Matlab scientific file >>format.</p></div> >> +<div class="section"> >> <h3><a name="Executable_programs_and_libraries">Executable programs and >>libraries</a></h3> >> -<p>The <a >>href="./api/org/apache/tika/parser/executable/ExecutableParser.html">Exec >>utableParser</a> can extract metadata information on platforms, >>architectures and types from a range of executable formats and >>libraries, such as Windows Executables and Linux / BSD programs and >>libraries.</p></div></div> >> +<p>The <a >>href="./api/org/apache/tika/parser/executable/ExecutableParser.html">Exec >>utableParser</a> can extract metadata information on platforms, >>architectures and types from a range of executable formats and >>libraries, such as Windows Executables and Linux / BSD programs and >>libraries.</p></div> >> +<div class="section"> >> +<h3><a name="Crypto_formats">Crypto formats</a></h3> >> +<p>The <a >>href="./api/org/apache/tika/parser/crypto/Pkcs7Parser.html">Pkcs7Parser</ >>a> is able to parse the contents of PKCS7 signed messages, but doesn't >>include any information from the outer PKCS7 wrapper.</p></div></div> >> <div class="section"> >> <h2>Full list of supported formats:<a >>name="Full_list_of_supported_formats:"></a></h2> >> <ul> >> @@ -270,6 +283,9 @@ >> <li>org.apache.tika.parser.mail.<a >>href="./api/org/apache/tika/parser/mail/RFC822Parser">RFC822Parser</a> >> <ul> >> <li>message/rfc822</li></ul></li> >> +<li>org.apache.tika.parser.mat.<a >>href="./api/org/apache/tika/parser/mat/MatParser">MatParser</a> >> +<ul> >> +<li>application/x-matlab-data</li></ul></li> >> <li>org.apache.tika.parser.mbox.<a >>href="./api/org/apache/tika/parser/mbox/MboxParser">MboxParser</a> >> <ul> >> <li>application/mbox</li></ul></li> >> >> Added: tika/site/publish/1.6/gettingstarted.html >> URL: >>http://svn.apache.org/viewvc/tika/site/publish/1.6/gettingstarted.html?re >>v=1622762&view=auto >> >>========================================================================= >>===== >> --- tika/site/publish/1.6/gettingstarted.html (added) >> +++ tika/site/publish/1.6/gettingstarted.html Fri Sep 5 19:14:58 2014 >> @@ -0,0 +1,413 @@ >> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" >> + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> + >> +<!-- >> + Licensed to the Apache Software Foundation (ASF) under one >> + or more contributor license agreements. See the NOTICE file >> + distributed with this work for additional information >> + regarding copyright ownership. The ASF licenses this file >> + to you under the Apache License, Version 2.0 (the >> + "License"); you may not use this file except in compliance >> + with the License. You may obtain a copy of the License at >> + >> + http://www.apache.org/licenses/LICENSE-2.0 >> + >> + Unless required by applicable law or agreed to in writing, >> + software distributed under the License is distributed on an >> + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY >> + KIND, either express or implied. See the License for the >> + specific language governing permissions and limitations >> + under the License. >> +--> >> + >> + >> + >> + >> + >> + >> + >> +<html xmlns="http://www.w3.org/1999/xhtml"> >> + <head> >> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >>/> >> + <title>Apache Tika - Getting Started with Apache Tika</title> >> + <style type="text/css" media="all"> >> + @import url("../css/site.css"); >> + </style> >> + <link rel="icon" type="image/png" href="../tikaNoText16.png" /> >> + <script type="text/javascript"> >> + function selectProvider(form) { >> + provider = form.elements['searchProvider'].value; >> + if (provider == "any") { >> + if (Math.random() > 0.5) { >> + provider = "lucid"; >> + } else { >> + provider = "sl"; >> + } >> + } >> + if (provider == "lucid") { >> + form.action = "http://find.searchhub.org/p:tika"; >> + } else if (provider == "sl") { >> + form.action = "http://search-lucene.com/tika"; >> + } >> + days = 90; >> + date = new Date(); >> + date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000)); >> + expires = "; expires=" + date.toGMTString(); >> + document.cookie = "searchProvider=" + provider + expires + "; >>path=/"; >> + } >> + function initProvider() { >> + if (document.cookie.length>0) { >> + cStart=document.cookie.indexOf("searchProvider="); >> + if (cStart!=-1) { >> + cStart=cStart + "searchProvider=".length; >> + cEnd=document.cookie.indexOf(";", cStart); >> + if (cEnd==-1) { >> + cEnd=document.cookie.length; >> + } >> + provider = >>unescape(document.cookie.substring(cStart,cEnd)); >> + >>document.forms['searchform'].elements['searchProvider'].value = provider; >> + } >> + } >> + document.forms['searchform'].elements['q'].focus(); >> + } >> + </script> >> + </head> >> + <body onLoad="initProvider();"> >> + <div id="body"> >> + <div id="banner"> >> + <a href="http://tika.apache.org" id="bannerLeft" title="Apache >>Tika" >> + ><img src="http://tika.apache.org/tika.png" alt="Apache Tika" >> + width="292" height="100"/></a> >> + <a href="http://www.apache.org/" id="bannerRight" >> + title="The Apache Software Foundation" >> + ><img src="http://tika.apache.org/asf-logo.gif" alt="The >>Apache Software Foundation" >> + width="387" height="100"/></a> >> + </div> >> + <div id="content"> >> + <!-- Licensed to the Apache Software Foundation (ASF) under >>one or more --><!-- contributor license agreements. See the NOTICE file >>distributed with --><!-- this work for additional information regarding >>copyright ownership. --><!-- The ASF licenses this file to You under the >>Apache License, Version 2.0 --><!-- (the "License"); you may not use >>this file except in compliance with --><!-- the License. You may obtain >>a copy of the License at --><!-- --><!-- >>http://www.apache.org/licenses/LICENSE-2.0 --><!-- --><!-- Unless >>required by applicable law or agreed to in writing, software --><!-- >>distributed under the License is distributed on an "AS IS" BASIS, >>--><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>implied. --><!-- See the License for the specific language governing >>permissions and --><!-- limitations under the License. --><div >>class="section"> >> +<h2>Getting Started with Apache Tika<a >>name="Getting_Started_with_Apache_Tika"></a></h2> >> +<p>This document describes how to build Apache Tika from sources and >>how to start using Tika in an application.</p></div> >> +<div class="section"> >> +<h2>Getting and building the sources<a >>name="Getting_and_building_the_sources"></a></h2> >> +<p>To build Tika from sources you first need to either <a >>href="../download.html">download</a> a source release or <a >>href="../source-repository.html">checkout</a> the latest sources from >>version control.</p> >> +<p>Once you have the sources, you can build them using the <a >>class="externalLink" href="http://maven.apache.org/">Maven 2</a> build >>system. Executing the following command in the base directory will build >>the sources and install the resulting artifacts in your local Maven >>repository.</p> >> +<div> >> +<pre>mvn install</pre></div> >> +<p>See the Maven documentation for more information about the >>available build options.</p> >> +<p>Note that you need Java 6 or higher to build Tika.</p></div> >> +<div class="section"> >> +<h2>Build artifacts<a name="Build_artifacts"></a></h2> >> +<p>The Tika build consists of a number of components and produces the >>following main binaries:</p> >> +<dl> >> +<dt>tika-core/target/tika-core-*.jar</dt> >> +<dd> Tika core library. Contains the core interfaces and classes of >>Tika, but none of the parser implementations. Depends only on Java >>6.</dd> >> +<dt>tika-parsers/target/tika-parsers-*.jar</dt> >> +<dd> Tika parsers. Collection of classes that implement the Tika >>Parser interface based on various external parser libraries.</dd> >> +<dt>tika-app/target/tika-app-*.jar</dt> >> +<dd> Tika application. Combines the above components and all the >>external parser libraries into a single runnable jar with a GUI and a >>command line interface.</dd> >> +<dt>tika-bundle/target/tika-bundle-*.jar</dt> >> +<dd> Tika bundle. An OSGi bundle that combines tika-parsers with >>non-OSGified parser libraries to make them easy to deploy in an OSGi >>environment.</dd></dl></div> >> +<div class="section"> >> +<h2>Using Tika as a Maven dependency<a >>name="Using_Tika_as_a_Maven_dependency"></a></h2> >> +<p>The core library, tika-core, contains the key interfaces and >>classes of Tika and can be used by itself if you don't need the full set >>of parsers from the tika-parsers component. The tika-core dependency >>looks like this:</p> >> +<div> >> +<pre> <dependency> >> + <groupId>org.apache.tika</groupId> >> + <artifactId>tika-core</artifactId> >> + <version>...</version> >> + </dependency></pre></div> >> +<p>If you want to use Tika to parse documents (instead of simply >>detecting document types, etc.), you'll want to depend on tika-parsers >>instead: </p> >> +<div> >> +<pre> <dependency> >> + <groupId>org.apache.tika</groupId> >> + <artifactId>tika-parsers</artifactId> >> + <version>...</version> >> + </dependency></pre></div> >> +<p>Note that adding this dependency will introduce a number of >>transitive dependencies to your project, including one on tika-core. You >>need to make sure that these dependencies won't conflict with your >>existing project dependencies. You can use the following command in the >>tika-parsers directory to get a full listing of all the dependencies.</p> >> +<div> >> +<pre>$ mvn dependency:tree | grep :compile</pre></div></div> >> +<div class="section"> >> +<h2>Using Tika in an Ant project<a >>name="Using_Tika_in_an_Ant_project"></a></h2> >> +<p>Unless you use a dependency manager tool like <a >>class="externalLink" href="http://ant.apache.org/ivy/">Apache Ivy</a>, >>the easiest way to use Tika is to include either the tika-core or the >>tika-app jar in your classpath, depending on whether you want just the >>core functionality or also all the parser implementations.</p> >> +<div> >> +<pre><classpath> >> + ... <!-- your other classpath entries --> >> + >> + <!-- either: --> >> + <pathelement >>location="path/to/tika-core-${tika.version}.jar"/> >> + <!-- or: --> >> + <pathelement >>location="path/to/tika-app-${tika.version}.jar"/> >> + >> +</classpath></pre></div></div> >> +<div class="section"> >> +<h2>Using Tika as a command line utility<a >>name="Using_Tika_as_a_command_line_utility"></a></h2> >> +<p>The Tika application jar (tika-app-*.jar) can be used as a command >>line utility for extracting text content and metadata from all sorts of >>files. This runnable jar contains all the dependencies it needs, so you >>don't need to worry about classpath settings to run it.</p> >> +<p>The usage instructions are shown below.</p> >> +<div> >> +<pre>usage: java -jar tika-app.jar [option...] [file|port...] >> + >> +Options: >> + -? or --help Print this usage message >> + -v or --verbose Print debug level messages >> + -V or --version Print the Apache Tika version number >> + >> + -g or --gui Start the Apache Tika GUI >> + -s or --server Start the Apache Tika server >> + -f or --fork Use Fork Mode for out-of-process extraction >> + >> + -x or --xml Output XHTML content (default) >> + -h or --html Output HTML content >> + -t or --text Output plain text content >> + -T or --text-main Output plain text content (main content >>only) >> + -m or --metadata Output only metadata >> + -j or --json Output metadata in JSON >> + -y or --xmp Output metadata in XMP >> + -l or --language Output only language >> + -d or --detect Detect document type >> + -eX or --encoding=X Use output encoding X >> + -pX or --password=X Use document password X >> + -z or --extract Extract all attachements into current >>directory >> + --extract-dir=<dir> Specify target directory for -z >> + -r or --pretty-print For XML and XHTML outputs, adds newlines and >> + whitespace, for better readability >> + >> + --create-profile=X >> + Create NGram profile, where X is a profile name >> + --list-parsers >> + List the available document parsers >> + --list-parser-details >> + List the available document parsers, and their supported mime >>types >> + --list-detectors >> + List the available document detectors >> + --list-met-models >> + List the available metadata models, and their supported keys >> + --list-supported-types >> + List all known media types and related information >> + >> +Description: >> + Apache Tika will parse the file(s) specified on the >> + command line and output the extracted text content >> + or metadata to standard output. >> + >> + Instead of a file name you can also specify the URL >> + of a document to be parsed. >> + >> + If no file name or URL is specified (or the special >> + name "-" is used), then the standard input stream >> + is parsed. If no arguments were given and no input >> + data is available, the GUI is started instead. >> + >> +- GUI mode >> + >> + Use the "--gui" (or "-g") option to start the >> + Apache Tika GUI. You can drag and drop files from >> + a normal file explorer to the GUI window to extract >> + text content and metadata from the files. >> + >> +- Server mode >> + >> + Use the "--server" (or "-s") option to start >>the >> + Apache Tika server. The server will listen to the >> + ports you specify as one or more arguments.</pre></div> >> +<p>You can also use the jar as a component in a Unix pipeline or as an >>external tool in many scripting languages.</p> >> +<div> >> +<pre># Check if an Internet resource contains a specific keyword >> +curl http://.../document.doc \ >> + | java -jar tika-app.jar --text \ >> + | grep -q keyword</pre></div></div> >> + </div> >> + <div id="sidebar"> >> + <div id="navigation"> >> + <h5>Apache Tika</h5> >> + <ul> >> + >> + <li class="none"> >> + <a href="../index.html">Introduction</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../download.html">Download</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../contribute.html">Contribute</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../mail-lists.html">Mailing Lists</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://wiki.apache.org/tika/" >>class="externalLink">Tika Wiki</a> >> + </li> >> + >> + <li class="none"> >> + <a >>href="https://issues.apache.org/jira/browse/TIKA" >>class="externalLink">Issue Tracker</a> >> + </li> >> + </ul> >> + <h5>Documentation</h5> >> + <ul> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="expanded"> >> + <a href="../1.5/index.html">Apache Tika 1.5</a> >> + <ul> >> + >> + <li class="none"> >> + <a href="../1.5/gettingstarted.html">Getting >>Started</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/formats.html">Supported Formats</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/parser.html">Parser API</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/parser_guide.html">Parser 5min >>Quick Start Guide</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/detection.html">Content and >>Language Detection</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/api/">API Documentation</a> >> + </li> >> + </ul> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.4/index.html">Apache Tika 1.4</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.3/index.html">Apache Tika 1.3</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.2/index.html">Apache Tika 1.2</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.1/index.html">Apache Tika 1.1</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.0/index.html">Apache Tika 1.0</a> >> + </li> >> + </ul> >> + <h5>The Apache Software Foundation</h5> >> + <ul> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/foundation/" >>class="externalLink">About</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/licenses/" >>class="externalLink">License</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/security/" >>class="externalLink">Security</a> >> + </li> >> + >> + <li class="none"> >> + <a >>href="http://www.apache.org/foundation/sponsorship.html" >>class="externalLink">Sponsorship</a> >> + </li> >> + >> + <li class="none"> >> + <a >>href="http://www.apache.org/foundation/thanks.html" >>class="externalLink">Thanks</a> >> + </li> >> + </ul> >> + >> + <div id="search"> >> + <h5>Search with Apache Solr</h5> >> + <form action="http://search.lucidimagination.com/p:tika" >> + method="get" id="searchform"> >> + <input type="text" id="query" name="q"/> >> + <select name="searchProvider" id="searchProvider"> >> + <option value="any">provider</option> >> + <option value="lucid">Lucid Find</option> >> + <option value="sl">Search-Lucene</option> >> + </select> >> + <input type="submit" id="submit" value="Search" >>name="Search" >> + onclick="selectProvider(this.form)"/> >> + </form> >> + </div> >> + >> + <div id="bookpromo"> >> + <h5>Books about Tika</h5> >> + <p> >> + <a href="http://manning.com/mattmann/" title="Tika in >>Action" >> + ><img src="../mattmann_cover150.jpg" >> + width="150" height="186"/></a> >> + </p> >> + </div> >> + </div> >> + </div> >> + <div id="footer"> >> + <p> >> + Copyright © 2014 >> + <a href="http://www.apache.org/">The Apache Software >>Foundation</a>. >> + Site powered by <a href="http://maven.apache.org/">Apache >>Maven</a>. >> + Search powered by >> + <a href="http://www.lucidimagination.com">Lucid >>Imagination</a> >> + and <a href="http://sematext.com">Sematext</a>. >> + <br/> >> + Apache Tika, Tika, Apache, the Apache feather logo, and the >>Apache >> + Tika project logo are trademarks of The Apache Software >>Foundation. >> + </p> >> + </div> >> + </div> >> + </body> >> +</html> >> >> Added: tika/site/publish/1.6/parser.html >> URL: >>http://svn.apache.org/viewvc/tika/site/publish/1.6/parser.html?rev=162276 >>2&view=auto >> >>========================================================================= >>===== >> --- tika/site/publish/1.6/parser.html (added) >> +++ tika/site/publish/1.6/parser.html Fri Sep 5 19:14:58 2014 >> @@ -0,0 +1,372 @@ >> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" >> + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> + >> +<!-- >> + Licensed to the Apache Software Foundation (ASF) under one >> + or more contributor license agreements. See the NOTICE file >> + distributed with this work for additional information >> + regarding copyright ownership. The ASF licenses this file >> + to you under the Apache License, Version 2.0 (the >> + "License"); you may not use this file except in compliance >> + with the License. You may obtain a copy of the License at >> + >> + http://www.apache.org/licenses/LICENSE-2.0 >> + >> + Unless required by applicable law or agreed to in writing, >> + software distributed under the License is distributed on an >> + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY >> + KIND, either express or implied. See the License for the >> + specific language governing permissions and limitations >> + under the License. >> +--> >> + >> + >> + >> + >> + >> + >> + >> +<html xmlns="http://www.w3.org/1999/xhtml"> >> + <head> >> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >>/> >> + <title>Apache Tika - The Parser interface</title> >> + <style type="text/css" media="all"> >> + @import url("../css/site.css"); >> + </style> >> + <link rel="icon" type="image/png" href="../tikaNoText16.png" /> >> + <script type="text/javascript"> >> + function selectProvider(form) { >> + provider = form.elements['searchProvider'].value; >> + if (provider == "any") { >> + if (Math.random() > 0.5) { >> + provider = "lucid"; >> + } else { >> + provider = "sl"; >> + } >> + } >> + if (provider == "lucid") { >> + form.action = "http://find.searchhub.org/p:tika"; >> + } else if (provider == "sl") { >> + form.action = "http://search-lucene.com/tika"; >> + } >> + days = 90; >> + date = new Date(); >> + date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000)); >> + expires = "; expires=" + date.toGMTString(); >> + document.cookie = "searchProvider=" + provider + expires + "; >>path=/"; >> + } >> + function initProvider() { >> + if (document.cookie.length>0) { >> + cStart=document.cookie.indexOf("searchProvider="); >> + if (cStart!=-1) { >> + cStart=cStart + "searchProvider=".length; >> + cEnd=document.cookie.indexOf(";", cStart); >> + if (cEnd==-1) { >> + cEnd=document.cookie.length; >> + } >> + provider = >>unescape(document.cookie.substring(cStart,cEnd)); >> + >>document.forms['searchform'].elements['searchProvider'].value = provider; >> + } >> + } >> + document.forms['searchform'].elements['q'].focus(); >> + } >> + </script> >> + </head> >> + <body onLoad="initProvider();"> >> + <div id="body"> >> + <div id="banner"> >> + <a href="http://tika.apache.org" id="bannerLeft" title="Apache >>Tika" >> + ><img src="http://tika.apache.org/tika.png" alt="Apache Tika" >> + width="292" height="100"/></a> >> + <a href="http://www.apache.org/" id="bannerRight" >> + title="The Apache Software Foundation" >> + ><img src="http://tika.apache.org/asf-logo.gif" alt="The >>Apache Software Foundation" >> + width="387" height="100"/></a> >> + </div> >> + <div id="content"> >> + <!-- Licensed to the Apache Software Foundation (ASF) under >>one or more --><!-- contributor license agreements. See the NOTICE file >>distributed with --><!-- this work for additional information regarding >>copyright ownership. --><!-- The ASF licenses this file to You under the >>Apache License, Version 2.0 --><!-- (the "License"); you may not use >>this file except in compliance with --><!-- the License. You may obtain >>a copy of the License at --><!-- --><!-- >>http://www.apache.org/licenses/LICENSE-2.0 --><!-- --><!-- Unless >>required by applicable law or agreed to in writing, software --><!-- >>distributed under the License is distributed on an "AS IS" BASIS, >>--><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>implied. --><!-- See the License for the specific language governing >>permissions and --><!-- limitations under the License. --><div >>class="section"> >> +<h2>The Parser interface<a name="The_Parser_interface"></a></h2> >> +<p>The <a >>href="./api/org/apache/tika/parser/Parser.html">org.apache.tika.parser.Pa >>rser</a> interface is the key concept of Apache Tika. It hides the >>complexity of different file formats and parsing libraries while >>providing a simple and powerful mechanism for client applications to >>extract structured text content and metadata from all sorts of >>documents. All this is achieved with a single method:</p> >> +<div> >> +<pre>void parse( >> + InputStream stream, ContentHandler handler, Metadata metadata, >> + ParseContext context) throws IOException, SAXException, >>TikaException;</pre></div> >> +<p>The <tt>parse</tt> method takes the document to be parsed and >>related metadata as input and outputs the results as XHTML SAX events >>and extra metadata. The parse context argument is used to specify >>context information (like the current local) that is not related to any >>individual document. The main criteria that lead to this design were:</p> >> +<dl> >> +<dt>Streamed parsing</dt> >> +<dd>The interface should require neither the client application nor >>the parser implementation to keep the full document content in memory or >>spooled to disk. This allows even huge documents to be parsed without >>excessive resource requirements.</dd> >> +<dt>Structured content</dt> >> +<dd>A parser implementation should be able to include structural >>information (headings, links, etc.) in the extracted content. A client >>application can use this information for example to better judge the >>relevance of different parts of the parsed document.</dd> >> +<dt>Input metadata</dt> >> +<dd>A client application should be able to include metadata like the >>file name or declared content type with the document to be parsed. The >>parser implementation can use this information to better guide the >>parsing process.</dd> >> +<dt>Output metadata</dt> >> +<dd>A parser implementation should be able to return document metadata >>in addition to document content. Many document formats contain metadata >>like the name of the author that may be useful to client >>applications.</dd> >> +<dt>Context sensitivity</dt> >> +<dd>While the default settings and behaviour of Tika parsers should >>work well for most use cases, there are still situations where more >>fine-grained control over the parsing process is desirable. It should be >>easy to inject such context-specific information to the parsing process >>without breaking the layers of abstraction.</dd></dl> >> +<p>These criteria are reflected in the arguments of the <tt>parse</tt> >>method.</p> >> +<div class="section"> >> +<h3>Document input stream<a name="Document_input_stream"></a></h3> >> +<p>The first argument is an <a class="externalLink" >>href="http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html"> >>InputStream</a> for reading the document to be parsed.</p> >> +<p>If this document stream can not be read, then parsing stops and the >>thrown <a class="externalLink" >>href="http://docs.oracle.com/javase/6/docs/api/java/io/IOException.html"> >>IOException</a> is passed up to the client application. If the stream >>can be read but not parsed (for example if the document is corrupted), >>then the parser throws a <a >>href="./api/org/apache/tika/exception/TikaException.html">TikaException</ >>a>.</p> >> +<p>The parser implementation will consume this stream but <i>will not >>close it</i>. Closing the stream is the responsibility of the client >>application that opened it in the first place. The recommended pattern >>for using streams with the <tt>parse</tt> method is:</p> >> +<div> >> +<pre>InputStream stream = ...; // open the stream >> +try { >> + parser.parse(stream, ...); // parse the stream >> +} finally { >> + stream.close(); // close the stream >> +}</pre></div> >> +<p>Some document formats like the OLE2 Compound Document Format used >>by Microsoft Office are best parsed as random access files. In such >>cases the content of the input stream is automatically spooled to a >>temporary file that gets removed once parsed. A future version of Tika >>may make it possible to avoid this extra file if the input document is >>already a file in the local file system. See <a class="externalLink" >>href="https://issues.apache.org/jira/browse/TIKA-153">TIKA-153</a> for >>the status of this feature request.</p></div> >> +<div class="section"> >> +<h3>XHTML SAX events<a name="XHTML_SAX_events"></a></h3> >> +<p>The parsed content of the document stream is returned to the client >>application as a sequence of XHTML SAX events. XHTML is used to express >>structured content of the document and SAX events enable streamed >>processing. Note that the XHTML format is used here only to convey >>structural information, not to render the documents for browsing!</p> >> +<p>The XHTML SAX events produced by the parser implementation are sent >>to a <a class="externalLink" >>href="http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler >>.html">ContentHandler</a> instance given to the <tt>parse</tt> method. >>If this the content handler fails to process an event, then parsing >>stops and the thrown <a class="externalLink" >>href="http://docs.oracle.com/javase/6/docs/api/org/xml/sax/SAXException.h >>tml">SAXException</a> is passed up to the client application.</p> >> +<p>The overall structure of the generated event stream is (with >>indenting added for clarity):</p> >> +<div> >> +<pre><html xmlns="http://www.w3.org/1999/xhtml"> >> + <head> >> + <title>...</title> >> + </head> >> + <body> >> + ... >> + </body> >> +</html></pre></div> >> +<p>Parser implementations typically use the <a >>href="./apidocs/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLConten >>tHandler</a> utility class to generate the XHTML output.</p> >> +<p>Dealing with the raw SAX events can be a bit complex, so Apache >>Tika comes with a number of utility classes that can be used to process >>and convert the event stream to other representations.</p> >> +<p>For example, the <a >>href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandl >>er</a> class can be used to extract just the body part of the XHTML >>output and feed it either as SAX events to another content handler or as >>characters to an output stream, a writer, or simply a string. The >>following code snippet parses a document from the standard input stream >>and outputs the extracted text content to standard output:</p> >> +<div> >> +<pre>ContentHandler handler = new BodyContentHandler(System.out); >> +parser.parse(System.in, handler, ...);</pre></div> >> +<p>Another useful class is <a >>href="./api/org/apache/tika/parser/ParsingReader.html">ParsingReader</a> >>that uses a background thread to parse the document and returns the >>extracted text content as a character stream:</p> >> +<div> >> +<pre>InputStream stream = ...; // the document to be parsed >> +Reader reader = new ParsingReader(parser, stream, ...); >> +try { >> + ...; // read the document text using the reader >> +} finally { >> + reader.close(); // the document stream is closed >>automatically >> +}</pre></div></div> >> +<div class="section"> >> +<h3>Document metadata<a name="Document_metadata"></a></h3> >> +<p>The third argument to the <tt>parse</tt> method is used to pass >>document metadata both in and out of the parser. Document metadata is >>expressed as an <a >>href="./api/org/apache/tika/metadata/Metadata.html">Metadata</a> >>object.</p> >> +<p>The following are some of the more interesting metadata >>properties:</p> >> +<dl> >> +<dt>Metadata.RESOURCE_NAME_KEY</dt> >> +<dd>The name of the file or resource that contains the document. >> +<p>A client application can set this property to allow the parser to >>use file name heuristics to determine the format of the document.</p> >> +<p>The parser implementation may set this property if the file format >>contains the canonical name of the file (for example the Gzip format has >>a slot for the file name).</p></dd> >> +<dt>Metadata.CONTENT_TYPE</dt> >> +<dd>The declared content type of the document. >> +<p>A client application can set this property based on for example a >>HTTP Content-Type header. The declared content type may help the parser >>to correctly interpret the document.</p> >> +<p>The parser implementation sets this property to the content type >>according to which the document was parsed.</p></dd> >> +<dt>Metadata.TITLE</dt> >> +<dd>The title of the document. >> +<p>The parser implementation sets this property if the document format >>contains an explicit title field.</p></dd> >> +<dt>Metadata.AUTHOR</dt> >> +<dd>The name of the author of the document. >> +<p>The parser implementation sets this property if the document format >>contains an explicit author field.</p></dd></dl> >> +<p>Note that metadata handling is still being discussed by the Tika >>development team, and it is likely that there will be some (backwards >>incompatible) changes in metadata handling before Tika 1.0.</p></div> >> +<div class="section"> >> +<h3>Parse context<a name="Parse_context"></a></h3> >> +<p>The final argument to the <tt>parse</tt> method is used to inject >>context-specific information to the parsing process. This is useful for >>example when dealing with locale-specific date and number formats in >>Microsoft Excel spreadsheets. Another important use of the parse context >>is passing in the delegate parser instance to be used by two-phase >>parsers like the <a >>href="./api/org/apache/parser/pkg/PackageParser.html">PackageParser</a> >>subclasses. Some parser classes allow customization of the parsing >>process through strategy objects in the parse context.</p></div> >> +<div class="section"> >> +<h3>Parser implementations<a name="Parser_implementations"></a></h3> >> +<p>Apache Tika comes with a number of parser classes for parsing <a >>href="./formats.html">various document formats</a>. You can also extend >>Tika with your own parsers, and of course any contributions to Tika are >>warmly welcome.</p> >> +<p>The goal of Tika is to reuse existing parser libraries like <a >>class="externalLink" href="http://www.pdfbox.org/">PDFBox</a> or <a >>class="externalLink" href="http://poi.apache.org/">Apache POI</a> as >>much as possible, and so most of the parser classes in Tika are adapters >>to such external libraries.</p> >> +<p>Tika also contains some general purpose parser implementations that >>are not targeted at any specific document formats. The most notable of >>these is the <a >>href="./apidocs/org/apache/tika/parser/AutoDetectParser.html">AutoDetectP >>arser</a> class that encapsulates all Tika functionality into a single >>parser that can handle any types of documents. This parser will >>automatically determine the type of the incoming document based on >>various heuristics and will then parse the document >>accordingly.</p></div></div> >> + </div> >> + <div id="sidebar"> >> + <div id="navigation"> >> + <h5>Apache Tika</h5> >> + <ul> >> + >> + <li class="none"> >> + <a href="../index.html">Introduction</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../download.html">Download</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../contribute.html">Contribute</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../mail-lists.html">Mailing Lists</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://wiki.apache.org/tika/" >>class="externalLink">Tika Wiki</a> >> + </li> >> + >> + <li class="none"> >> + <a >>href="https://issues.apache.org/jira/browse/TIKA" >>class="externalLink">Issue Tracker</a> >> + </li> >> + </ul> >> + <h5>Documentation</h5> >> + <ul> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="expanded"> >> + <a href="../1.5/index.html">Apache Tika 1.5</a> >> + <ul> >> + >> + <li class="none"> >> + <a href="../1.5/gettingstarted.html">Getting >>Started</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/formats.html">Supported Formats</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/parser.html">Parser API</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/parser_guide.html">Parser 5min >>Quick Start Guide</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/detection.html">Content and >>Language Detection</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/api/">API Documentation</a> >> + </li> >> + </ul> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.4/index.html">Apache Tika 1.4</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.3/index.html">Apache Tika 1.3</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.2/index.html">Apache Tika 1.2</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.1/index.html">Apache Tika 1.1</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.0/index.html">Apache Tika 1.0</a> >> + </li> >> + </ul> >> + <h5>The Apache Software Foundation</h5> >> + <ul> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/foundation/" >>class="externalLink">About</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/licenses/" >>class="externalLink">License</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/security/" >>class="externalLink">Security</a> >> + </li> >> + >> + <li class="none"> >> + <a >>href="http://www.apache.org/foundation/sponsorship.html" >>class="externalLink">Sponsorship</a> >> + </li> >> + >> + <li class="none"> >> + <a >>href="http://www.apache.org/foundation/thanks.html" >>class="externalLink">Thanks</a> >> + </li> >> + </ul> >> + >> + <div id="search"> >> + <h5>Search with Apache Solr</h5> >> + <form action="http://search.lucidimagination.com/p:tika" >> + method="get" id="searchform"> >> + <input type="text" id="query" name="q"/> >> + <select name="searchProvider" id="searchProvider"> >> + <option value="any">provider</option> >> + <option value="lucid">Lucid Find</option> >> + <option value="sl">Search-Lucene</option> >> + </select> >> + <input type="submit" id="submit" value="Search" >>name="Search" >> + onclick="selectProvider(this.form)"/> >> + </form> >> + </div> >> + >> + <div id="bookpromo"> >> + <h5>Books about Tika</h5> >> + <p> >> + <a href="http://manning.com/mattmann/" title="Tika in >>Action" >> + ><img src="../mattmann_cover150.jpg" >> + width="150" height="186"/></a> >> + </p> >> + </div> >> + </div> >> + </div> >> + <div id="footer"> >> + <p> >> + Copyright © 2014 >> + <a href="http://www.apache.org/">The Apache Software >>Foundation</a>. >> + Site powered by <a href="http://maven.apache.org/">Apache >>Maven</a>. >> + Search powered by >> + <a href="http://www.lucidimagination.com">Lucid >>Imagination</a> >> + and <a href="http://sematext.com">Sematext</a>. >> + <br/> >> + Apache Tika, Tika, Apache, the Apache feather logo, and the >>Apache >> + Tika project logo are trademarks of The Apache Software >>Foundation. >> + </p> >> + </div> >> + </div> >> + </body> >> +</html> >> >> Added: tika/site/publish/1.6/parser_guide.html >> URL: >>http://svn.apache.org/viewvc/tika/site/publish/1.6/parser_guide.html?rev= >>1622762&view=auto >> >>========================================================================= >>===== >> --- tika/site/publish/1.6/parser_guide.html (added) >> +++ tika/site/publish/1.6/parser_guide.html Fri Sep 5 19:14:58 2014 >> @@ -0,0 +1,373 @@ >> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" >> + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> + >> +<!-- >> + Licensed to the Apache Software Foundation (ASF) under one >> + or more contributor license agreements. See the NOTICE file >> + distributed with this work for additional information >> + regarding copyright ownership. The ASF licenses this file >> + to you under the Apache License, Version 2.0 (the >> + "License"); you may not use this file except in compliance >> + with the License. You may obtain a copy of the License at >> + >> + http://www.apache.org/licenses/LICENSE-2.0 >> + >> + Unless required by applicable law or agreed to in writing, >> + software distributed under the License is distributed on an >> + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY >> + KIND, either express or implied. See the License for the >> + specific language governing permissions and limitations >> + under the License. >> +--> >> + >> + >> + >> + >> + >> + >> + >> +<html xmlns="http://www.w3.org/1999/xhtml"> >> + <head> >> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" >>/> >> + <title>Apache Tika - Get Tika parsing up and running in 5 >>minutes</title> >> + <style type="text/css" media="all"> >> + @import url("../css/site.css"); >> + </style> >> + <link rel="icon" type="image/png" href="../tikaNoText16.png" /> >> + <script type="text/javascript"> >> + function selectProvider(form) { >> + provider = form.elements['searchProvider'].value; >> + if (provider == "any") { >> + if (Math.random() > 0.5) { >> + provider = "lucid"; >> + } else { >> + provider = "sl"; >> + } >> + } >> + if (provider == "lucid") { >> + form.action = "http://find.searchhub.org/p:tika"; >> + } else if (provider == "sl") { >> + form.action = "http://search-lucene.com/tika"; >> + } >> + days = 90; >> + date = new Date(); >> + date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000)); >> + expires = "; expires=" + date.toGMTString(); >> + document.cookie = "searchProvider=" + provider + expires + "; >>path=/"; >> + } >> + function initProvider() { >> + if (document.cookie.length>0) { >> + cStart=document.cookie.indexOf("searchProvider="); >> + if (cStart!=-1) { >> + cStart=cStart + "searchProvider=".length; >> + cEnd=document.cookie.indexOf(";", cStart); >> + if (cEnd==-1) { >> + cEnd=document.cookie.length; >> + } >> + provider = >>unescape(document.cookie.substring(cStart,cEnd)); >> + >>document.forms['searchform'].elements['searchProvider'].value = provider; >> + } >> + } >> + document.forms['searchform'].elements['q'].focus(); >> + } >> + </script> >> + </head> >> + <body onLoad="initProvider();"> >> + <div id="body"> >> + <div id="banner"> >> + <a href="http://tika.apache.org" id="bannerLeft" title="Apache >>Tika" >> + ><img src="http://tika.apache.org/tika.png" alt="Apache Tika" >> + width="292" height="100"/></a> >> + <a href="http://www.apache.org/" id="bannerRight" >> + title="The Apache Software Foundation" >> + ><img src="http://tika.apache.org/asf-logo.gif" alt="The >>Apache Software Foundation" >> + width="387" height="100"/></a> >> + </div> >> + <div id="content"> >> + <!-- Licensed to the Apache Software Foundation (ASF) under >>one or more --><!-- contributor license agreements. See the NOTICE file >>distributed with --><!-- this work for additional information regarding >>copyright ownership. --><!-- The ASF licenses this file to You under the >>Apache License, Version 2.0 --><!-- (the "License"); you may not use >>this file except in compliance with --><!-- the License. You may obtain >>a copy of the License at --><!-- --><!-- >>http://www.apache.org/licenses/LICENSE-2.0 --><!-- --><!-- Unless >>required by applicable law or agreed to in writing, software --><!-- >>distributed under the License is distributed on an "AS IS" BASIS, >>--><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>implied. --><!-- See the License for the specific language governing >>permissions and --><!-- limitations under the License. --><div >>class="section"> >> +<h2>Get Tika parsing up and running in 5 minutes<a >>name="Get_Tika_parsing_up_and_running_in_5_minutes"></a></h2> >> +<p>This page is a quick start guide showing how to add a new parser to >>Apache Tika. Following the simple steps listed below your new parser can >>be running in only 5 minutes.</p> >> +<ul> >> +<li><a href="#Get_Tika_parsing_up_and_running_in_5_minutes">Get Tika >>parsing up and running in 5 minutes</a> >> +<ul> >> +<li><a href="#Getting_Started">Getting Started</a></li> >> +<li><a href="#Add_your_MIME-Type">Add your MIME-Type</a></li> >> +<li><a href="#Create_your_Parser_class">Create your Parser >>class</a></li> >> +<li><a href="#List_the_new_parser">List the new >>parser</a></li></ul></li></ul> >> +<div class="section"> >> +<h3><a name="Getting_Started">Getting Started</a></h3> >> +<p>The <a href="./gettingstarted.html">Getting Started</a> document >>describes how to build Apache Tika from sources and how to start using >>Tika in an application. Pay close attention and follow the instructions >>in the "Getting and building the sources" section.</p></div> >> +<div class="section"> >> +<h3><a name="Add_your_MIME-Type">Add your MIME-Type</a></h3> >> +<p>Tika loads the core, standard MIME-Types from the file >>"org/apache/tika/mime/tika-mimetypes.xml", which comes from <a >>class="externalLink" >>href="http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resou >>rces/org/apache/tika/mime/tika-mimetypes.xml">tika-core/src/main/resource >>s/org/apache/tika/mime/tika-mimetypes.xml</a> . If your new MIME-Type is >>a standard one which is missing from Tika, submit a patch for this >>file!</p> >> +<p>If your MIME-Type needs adding, create a new file >>"org/apache/tika/mime/custom-mimetypes.xml" in your codebase. >>You should add to it something like this:</p> >> +<div> >> +<pre> <?xml version="1.0" encoding="UTF-8"?> >> + <mime-info> >> + <mime-type type="application/hello"> >> + <glob pattern="*.hi"/> >> + </mime-type> >> + </mime-info></pre></div></div> >> +<div class="section"> >> +<h3><a name="Create_your_Parser_class">Create your Parser >>class</a></h3> >> +<p>Now, you need to create your new parser. This is a class that must >>implement the Parser interface offered by Tika. Instead of implementing >>the Parser interface directly, it is recommended that you extend the >>abstract class AbstractParser if possible. AbstractParser handles >>translating between API changes for you.</p> >> +<p>A very simple Tika Parser looks like this:</p> >> +<div> >> +<pre>/* >> + * Licensed to the Apache Software Foundation (ASF) under one or more >> + * contributor license agreements. See the NOTICE file distributed >>with >> + * this work for additional information regarding copyright ownership. >> + * The ASF licenses this file to You under the Apache License, Version >>2.0 >> + * (the "License"); you may not use this file except in >>compliance with >> + * the License. You may obtain a copy of the License at >> + * >> + * http://www.apache.org/licenses/LICENSE-2.0 >> + * >> + * Unless required by applicable law or agreed to in writing, software >> + * distributed under the License is distributed on an "AS >>IS" BASIS, >> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>implied. >> + * See the License for the specific language governing permissions and >> + * limitations under the License. >> + * >> + * @Author: Arturo Beltran >> + */ >> +package org.apache.tika.parser.hello; >> + >> +import java.io.IOException; >> +import java.io.InputStream; >> +import java.util.Collections; >> +import java.util.Set; >> + >> +import org.apache.tika.exception.TikaException; >> +import org.apache.tika.metadata.Metadata; >> +import org.apache.tika.mime.MediaType; >> +import org.apache.tika.parser.ParseContext; >> +import org.apache.tika.parser.AbstractParser; >> +import org.apache.tika.sax.XHTMLContentHandler; >> +import org.xml.sax.ContentHandler; >> +import org.xml.sax.SAXException; >> + >> +public class HelloParser extends AbstractParser { >> + >> + private static final Set<MediaType> SUPPORTED_TYPES = >>Collections.singleton(MediaType.application("hello")); >> + public static final String HELLO_MIME_TYPE = >>"application/hello"; >> + >> + public Set<MediaType> getSupportedTypes(ParseContext >>context) { >> + return SUPPORTED_TYPES; >> + } >> + >> + public void parse( >> + InputStream stream, ContentHandler handler, >> + Metadata metadata, ParseContext context) >> + throws IOException, SAXException, >>TikaException { >> + >> + metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE); >> + metadata.set("Hello", "World"); >> + >> + XHTMLContentHandler xhtml = new >>XHTMLContentHandler(handler, metadata); >> + xhtml.startDocument(); >> + xhtml.endDocument(); >> + } >> +}</pre></div> >> +<p>Pay special attention to the definition of the SUPPORTED_TYPES >>static class field in the parser class that defines what MIME-Types it >>supports. If your MIME-Types aren't standard ones, ensure you listed >>them in a "custom-mimetypes.xml" file so that Tika knows about >>them (see above).</p> >> +<p>Is in the "parse" method where you will do all your work. >>This is, extract the information of the resource and then set the >>metadata.</p></div> >> +<div class="section"> >> +<h3><a name="List_the_new_parser">List the new parser</a></h3> >> +<p>Finally, you should explicitly tell the AutoDetectParser to include >>your new parser. This step is only needed if you want to use the >>AutoDetectParser functionality. If you figure out the correct parser in >>a different way, it isn't needed. </p> >> +<p>List your new parser in: <a class="externalLink" >>href="http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/re >>sources/META-INF/services/org.apache.tika.parser.Parser">tika-parsers/src >>/main/resources/META-INF/services/org.apache.tika.parser.Parser</a></p></ >>div></div> >> + </div> >> + <div id="sidebar"> >> + <div id="navigation"> >> + <h5>Apache Tika</h5> >> + <ul> >> + >> + <li class="none"> >> + <a href="../index.html">Introduction</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../download.html">Download</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../contribute.html">Contribute</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../mail-lists.html">Mailing Lists</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://wiki.apache.org/tika/" >>class="externalLink">Tika Wiki</a> >> + </li> >> + >> + <li class="none"> >> + <a >>href="https://issues.apache.org/jira/browse/TIKA" >>class="externalLink">Issue Tracker</a> >> + </li> >> + </ul> >> + <h5>Documentation</h5> >> + <ul> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="expanded"> >> + <a href="../1.5/index.html">Apache Tika 1.5</a> >> + <ul> >> + >> + <li class="none"> >> + <a href="../1.5/gettingstarted.html">Getting >>Started</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/formats.html">Supported Formats</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/parser.html">Parser API</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/parser_guide.html">Parser 5min >>Quick Start Guide</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/detection.html">Content and >>Language Detection</a> >> + </li> >> + >> + <li class="none"> >> + <a href="../1.5/api/">API Documentation</a> >> + </li> >> + </ul> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.4/index.html">Apache Tika 1.4</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.3/index.html">Apache Tika 1.3</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.2/index.html">Apache Tika 1.2</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.1/index.html">Apache Tika 1.1</a> >> + </li> >> + >> + >> + >> + >> + >> + >> + >> + >> + >> + <li class="collapsed"> >> + <a href="../1.0/index.html">Apache Tika 1.0</a> >> + </li> >> + </ul> >> + <h5>The Apache Software Foundation</h5> >> + <ul> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/foundation/" >>class="externalLink">About</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/licenses/" >>class="externalLink">License</a> >> + </li> >> + >> + <li class="none"> >> + <a href="http://www.apache.org/security/" >>class="externalLink">Security</a> >> + </li> >> + >> + <li class="none"> >> + <a >>href="http://www.apache.org/foundation/sponsorship.html" >>class="externalLink">Sponsorship</a> >> + </li> >> + >> + <li class="none"> >> + <a >>href="http://www.apache.org/foundation/thanks.html" >>class="externalLink">Thanks</a> >> + </li> >> + </ul> >> + >> + <div id="search"> >> + <h5>Search with Apache Solr</h5> >> + <form action="http://search.lucidimagination.com/p:tika" >> + method="get" id="searchform"> >> + <input type="text" id="query" name="q"/> >> + <select name="searchProvider" id="searchProvider"> >> + <option value="any">provider</option> >> + <option value="lucid">Lucid Find</option> >> + <option value="sl">Search-Lucene</option> >> + </select> >> + <input type="submit" id="submit" value="Search" >>name="Search" >> + onclick="selectProvider(this.form)"/> >> + </form> >> + </div> >> + >> + <div id="bookpromo"> >> + <h5>Books about Tika</h5> >> + <p> >> + <a href="http://manning.com/mattmann/" title="Tika in >>Action" >> + ><img src="../mattmann_cover150.jpg" >> + width="150" height="186"/></a> >> + </p> >> + </div> >> + </div> >> + </div> >> + <div id="footer"> >> + <p> >> + Copyright © 2014 >> + <a href="http://www.apache.org/">The Apache Software >>Foundation</a>. >> + Site powered by <a href="http://maven.apache.org/">Apache >>Maven</a>. >> + Search powered by >> + <a href="http://www.lucidimagination.com">Lucid >>Imagination</a> >> + and <a href="http://sematext.com">Sematext</a>. >> + <br/> >> + Apache Tika, Tika, Apache, the Apache feather logo, and the >>Apache >> + Tika project logo are trademarks of The Apache Software >>Foundation. >> + </p> >> + </div> >> + </div> >> + </body> >> +</html> >> >>
