Added: tika/site/src/site/apt/1.11/detection.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.11/detection.apt?rev=1710493&view=auto ============================================================================== --- tika/site/src/site/apt/1.11/detection.apt (added) +++ tika/site/src/site/apt/1.11/detection.apt Sun Oct 25 22:30:51 2015 @@ -0,0 +1,211 @@ + ----------------- + Content Detection + ----------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Content Detection + + This page gives you information on how content and language detection + works with Apache Tika, and how to tune the behaviour of Tika. + +%{toc|section=1|fromDepth=1} + +* {The Detector Interface} + + The + {{{./api/org/apache/tika/detect/Detector.html}org.apache.tika.detect.Detector}} + interface is the basis for most of the content type detection in Apache + Tika. All the different ways of detecting content all implement the + same common method: + +--- +MediaType detect(java.io.InputStream input, + Metadata metadata) throws java.io.IOException +--- + + The <<<detect>>> method takes the stream to inspect, and a + <<<Metadata>>> object that holds any additional information on + the content. The detector will return a + {{{./api/org/apache/tika/mime/MediaType.html}MediaType}} object describing + its best guess as to the type of the file. + + In general, only two keys on the Metadata object are used by Detectors. + These are <<<Metadata.RESOURCE_NAME_KEY>>> which should hold the name + of the file (where known), and <<<Metadata.CONTENT_TYPE>>> which should + hold the advertised content type of the file (eg from a webserver or + a content repository). + + +* {Mime Magic Detection} + + By looking for special ("magic") patterns of bytes near the start of + the file, it is often possible to detect the type of the file. For + some file types, this is a simple process. For others, typically + container based formats, the magic detection may not be enough. (More + detail on detecting container formats below) + + Tika is able to make use of a a mime magic info file, in the + {{{http://www.freedesktop.org/standards/shared-mime-info}Freedesktop MIME-info}} + format to peform mime magic detection. (Note that Tika supports a few + more match types than Freedesktop does) + + This is provided within Tika by + {{{./api/org/apache/tika/detect/MagicDetector.html}org.apache.tika.detect.MagicDetector}}. It is most commonly access via + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}, + normally sourced from the <<<tika-mimetypes.xml>>> and <<<custom-mimetypes.xml>>> + files. For more information on defining your own custom mimetypes, see + {{{./parser_guide.html#Add_your_MIME-Type}the new parser guide}}. + + +* {Resource Name Based Detection} + + Where the name of the file is known, it is sometimes possible to guess + the file type from the name or extension. Within the + <<<tika-mimetypes.xml>>> file is a list of patterns which are used to + identify the type from the filename. + + However, because files may be renamed, this method of detection is quick + but not always as accurate. + + This is provided within Tika by + {{{./api/org/apache/tika/detect/NameDetector.html}org.apache.tika.detect.NameDetector}}. + + +* {Known Content Type "Detection} + + Sometimes, the mime type for a file is already known, such as when + downloading from a webserver, or when retrieving from a content store. + This information can be used by detectors, such as + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}, + + +* {The default Mime Types Detector} + + By default, the mime type detection in Tika is provided by + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}. + This detector makes use of <<<tika-mimetypes.xml>>> to power + magic based and filename based detection. + + Firstly, magic based detection is used on the start of the file. + If the file is an XML file, then the start of the XML is processed + to look for root elements. Next, if available, the filename + (from <<<Metadata.RESOURCE_NAME_KEY>>>) is + then used to improve the detail of the detection, such as when magic + detects a text file, and the filename hints it's really a CSV. Finally, + if available, the supplied content type (from <<<Metadata.CONTENT_TYPE>>>) + is used to further refine the type. + + +* {Container Aware Detection} + + Several common file formats are actually held within a common container + format. One example is the PowerPoint .ppt and Word .doc formats, which + are both held within an OLE2 container. Another is Apple iWork formats, + which are actually a series of XML files within a Zip file. + + Using magic detection, it is easy to spot that a given file is an OLE2 + document, or a Zip file. Using magic detection alone, it is very difficult + (and often impossible) to tell what kind of file lives inside the container. + + For some use cases, speed is important, so having a quick way to know the + container type is sufficient. For other cases however, you don't mind + spending a bit of time (and memory!) processing the container to get a + more accurate answer on its contents. For these cases, the additional + container aware detectors contained in the <<<Tika Parsers>>> jar should + be used. + + Tika provides a wrapping detector in the form of + {{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}}. + This uses the service loader to discover all available detectors, including + any available container aware ones, and tries them in turn. For container + aware detection, include the <<<Tika Parsers>>> jar and its dependencies + in your project, then use DefaultDetector along with a <<<TikaInputStream>>>. + + Because these container detectors needs to read the whole file to open and + inspect the container, they must be used with a + {{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}. + If called with a regular <<<InputStream>>>, then all work will be done + by the default Mime Magic detection only. + + For more information on container formats and Tika, see + {{{http://wiki.apache.org/tika/MetadataDiscussion}}} + + +* {The default Tika Detector} + + Just as with Parsers, Tika provides a special detector + {{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}} + which auto-detects (based on service files) the available detectors at + runtime, and tries these in turn to identify the file type. + + If only <<<Tika Core>>> is available, the Default Detector will work only + with Mime Magic and Resource Name detection. However, if <<<Tika Parsers>>> + (and its dependencies!) are available, additional detectors which known about + containers (such as zip and ole2) will be used as appropriate, provided that + detection is being performed with a + {{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}. + Custom detectors can also be used as desired, they simply need to be listed + in a service file much as is done for + {{{./parser_guide.html#List_the_new_parser}custom parsers}}. + + +* {Ways of triggering Detection} + + The simplest way to detect is through the + {{{./api/org/apache/tika/Tika.html}Tika Facade class}}, which provides methods to + detect based on + {{{./api/org/apache/tika/Tika.html##detect(java.io.File)}File}}, + {{{./api/org/apache/tika/Tika.html##detect(java.io.InputStream)}InputStream}}, + {{{./api/org/apache/tika/Tika.html##detect(java.io.InputStream, java.lang.String)}InputStream and Filename}}, + {{{./api/org/apache/tika/Tika.html##detect(java.lang.String)}Filename}} or a few others. + It works best with a File or + {{{./api/org/apache/tika/io/TikaInputStream.html}TikaInputStream}}. + + Alternately, detection can be performed on a specific Detector, or using + <<<DefaultDetector>>> to have all available Detectors used. A typical pattern + would be something like: + +--- +TikaConfig tika = new TikaConfig(); + +for (File f : myListOfFiles) { + Metadata metadata = new Metadata(); + metadata.set(Metadata.RESOURCE_NAME_KEY, f.toString()); + String mimetype = tika.getDetector().detect( + TikaInputStream.get(f), metadata); + System.out.println("File " + f + " is " + mimetype); +} +for (InputStream is : myListOfStreams) { + String mimetype = tika.getDetector().detect( + TikaInputStream.get(is), new Metadata()); + System.out.println("Stream " + is + " is " + mimetype); +} +--- + +* {Language Detection} + + Tika is able to help identify the language of a piece of text, which + is useful when extracting text from document formats which do not include + language information in their metadata. + + The language detection is provided by + {{{./api/org/apache/tika/language/LanguageIdentifier.html}org.apache.tika.language.LanguageIdentifier}} + +* {More Examples} + + For more examples of Detection using Apache Tika, please take a look at + the {{{./examples.html}Tika Examples page}}.
Added: tika/site/src/site/apt/1.11/gettingstarted.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.11/gettingstarted.apt?rev=1710493&view=auto ============================================================================== --- tika/site/src/site/apt/1.11/gettingstarted.apt (added) +++ tika/site/src/site/apt/1.11/gettingstarted.apt Sun Oct 25 22:30:51 2015 @@ -0,0 +1,217 @@ + -------------------------------- + Getting Started with Apache Tika + -------------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Getting Started with Apache Tika + + This document describes how to build Apache Tika from sources and + how to start using Tika in an application. + +Getting and building the sources + + To build Tika from sources you first need to either + {{{../download.html}download}} a source release or + {{{../source-repository.html}checkout}} the latest sources from + version control. + + Once you have the sources, you can build them using the + {{{http://maven.apache.org/}Maven 2}} build system. Executing the + following command in the base directory will build the sources + and install the resulting artifacts in your local Maven repository. + +--- +mvn install +--- + + See the Maven documentation for more information about the available + build options. + + Note that you need Java 7 or higher to build Tika. + +Build artifacts + + The Tika build consists of a number of components and produces + the following main binaries: + + [tika-core/target/tika-core-*.jar] + Tika core library. Contains the core interfaces and classes of Tika, + but none of the parser implementations. Depends only on Java 6. + + [tika-parsers/target/tika-parsers-*.jar] + Tika parsers. Collection of classes that implement the Tika Parser + interface based on various external parser libraries. + + [tika-app/target/tika-app-*.jar] + Tika application. Combines the above components and all the external + parser libraries into a single runnable jar with a GUI and a command + line interface. + + [tika-server/target/tika-server-*.jar] + Tika JAX-RS REST application. This is a Jetty web server running Tika + REST services as described in {{{http://wiki.apache.org/tika/TikaJAXRS}this page}}. + + [tika-bundle/target/tika-bundle-*.jar] + Tika bundle. An OSGi bundle that combines tika-parsers with non-OSGified + parser libraries to make them easy to deploy in an OSGi environment. + +Using Tika as a Maven dependency + + The core library, tika-core, contains the key interfaces and classes of Tika + and can be used by itself if you don't need the full set of parsers from + the tika-parsers component. The tika-core dependency looks like this: + +--- + <dependency> + <groupId>org.apache.tika</groupId> + <artifactId>tika-core</artifactId> + <version>...</version> + </dependency> +--- + + If you want to use Tika to parse documents (instead of simply detecting + document types, etc.), you'll want to depend on tika-parsers instead: + +--- + <dependency> + <groupId>org.apache.tika</groupId> + <artifactId>tika-parsers</artifactId> + <version>...</version> + </dependency> +--- + + Note that adding this dependency will introduce a number of + transitive dependencies to your project, including one on tika-core. + You need to make sure that these dependencies won't conflict with your + existing project dependencies. You can use the following command in + the tika-parsers directory to get a full listing of all the dependencies. + +--- +$ mvn dependency:tree | grep :compile +--- + +Using Tika in an Ant project + + Unless you use a dependency manager tool like + {{{http://ant.apache.org/ivy/}Apache Ivy}}, the easiest way to use + Tika is to include either the tika-core or the tika-app jar in your + classpath, depending on whether you want just the core functionality + or also all the parser implementations. + +--- +<classpath> + ... <!-- your other classpath entries --> + + <!-- either: --> + <pathelement location="path/to/tika-core-${tika.version}.jar"/> + <!-- or: --> + <pathelement location="path/to/tika-app-${tika.version}.jar"/> + +</classpath> +--- + +Using Tika as a command line utility + + The Tika application jar (tika-app-*.jar) can be used as a command + line utility for extracting text content and metadata from all sorts of + files. This runnable jar contains all the dependencies it needs, so + you don't need to worry about classpath settings to run it. + + The usage instructions are shown below. + +--- +usage: java -jar tika-app.jar [option...] [file|port...] + +Options: + -? or --help Print this usage message + -v or --verbose Print debug level messages + -V or --version Print the Apache Tika version number + + -g or --gui Start the Apache Tika GUI + -s or --server Start the Apache Tika server + -f or --fork Use Fork Mode for out-of-process extraction + + -x or --xml Output XHTML content (default) + -h or --html Output HTML content + -t or --text Output plain text content + -T or --text-main Output plain text content (main content only) + -m or --metadata Output only metadata + -j or --json Output metadata in JSON + -y or --xmp Output metadata in XMP + -l or --language Output only language + -d or --detect Detect document type + -eX or --encoding=X Use output encoding X + -pX or --password=X Use document password X + -z or --extract Extract all attachements into current directory + --extract-dir=<dir> Specify target directory for -z + -r or --pretty-print For XML and XHTML outputs, adds newlines and + whitespace, for better readability + + --create-profile=X + Create NGram profile, where X is a profile name + --list-parsers + List the available document parsers + --list-parser-details + List the available document parsers, and their supported mime types + --list-detectors + List the available document detectors + --list-met-models + List the available metadata models, and their supported keys + --list-supported-types + List all known media types and related information + +Description: + Apache Tika will parse the file(s) specified on the + command line and output the extracted text content + or metadata to standard output. + + Instead of a file name you can also specify the URL + of a document to be parsed. + + If no file name or URL is specified (or the special + name "-" is used), then the standard input stream + is parsed. If no arguments were given and no input + data is available, the GUI is started instead. + +- GUI mode + + Use the "--gui" (or "-g") option to start the + Apache Tika GUI. You can drag and drop files from + a normal file explorer to the GUI window to extract + text content and metadata from the files. + +- Server mode + + Use the "--server" (or "-s") option to start the + Apache Tika server. The server will listen to the + ports you specify as one or more arguments. +--- + + You can also use the jar as a component in a Unix pipeline or + as an external tool in many scripting languages. + +--- +# Check if an Internet resource contains a specific keyword +curl http://.../document.doc \ + | java -jar tika-app.jar --text \ + | grep -q keyword +--- + +Wrappers + + Several wrappers are available to use Tika in another programming language, + such as {{{https://github.com/aviks/Taro.jl}Julia}} or {{{https://github.com/chrismattmann/tika-python}Python}}. Added: tika/site/src/site/apt/1.11/index.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.11/index.apt?rev=1710493&view=auto ============================================================================== --- tika/site/src/site/apt/1.11/index.apt (added) +++ tika/site/src/site/apt/1.11/index.apt Sun Oct 25 22:30:51 2015 @@ -0,0 +1,128 @@ + ---------------- + Apache Tika 1.11 + ---------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Apache Tika 1.11 + + The most notable changes in Tika 1.11 over the previous release are: + + * Fix regression with spacing in PPT via Andreas Beeker + ({{{http://issues.apache.org/jira/browse/TIKA-1777}TIKA-1777}}). + + * Java7 API support for allowing java.nio.file.Path as method arguments + was added to Tika and to ParsingReader, TikaFileTypeDetector, and to + Tika Config ({{{http://issues.apache.org/jira/browse/TIKA-1745}TIKA-1745}}, + {{{http://issues.apache.org/jira/browse/TIKA-1746}TIKA-1746}}, + {{{http://issues.apache.org/jira/browse/TIKA-1751}TIKA-1751}}). + + * MIME support was added for WebVTT: The Web Video Text Tracks Format + files ({{{http://issues.apache.org/jira/browse/TIKA-1772}TIKA-1772}}). + + * MIME magic improved to ensure emails detected as message/rfc822 + ({{{http://issues.apache.org/jira/browse/TIKA-1771}TIKA-1771}}). + + * Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility + with Bouncy Castle + ({{{http://issues.apache.org/jira/browse/TIKA-1736}TIKA-1736}}). + + * Make div and other markup more consistent between PPT and + PPTX ({{{http://issues.apache.org/jira/browse/TIKA-1755}TIKA-1755}}). + + * Parse multiple authors from MSOffice's semi-colon delimited + author field ({{{http://issues.apache.org/jira/browse/TIKA-1765}TIKA-1765}}). + + * Include CTAKESConfig.properties within tika-parsers resources + by default + ({{{http://issues.apache.org/jira/browse/TIKA-1741}TIKA-1741}}). + + * Prevent infinite recursion when processing inline images + in PDF files by limiting extraction of duplicate images + within the same page + ({{{http://issues.apache.org/jira/browse/TIKA-1742}TIKA-1742}}). + + * Upgrade to POI 3.13-final (via Andreas Beeker) + ({{{http://issues.apache.org/jira/browse/TIKA-1707}TIKA-1707}}). + + * Upgraded tika-batch to use Path throughout (TIKA-1747 and + (TIKA-1754). + + * Upgraded to Path in TikaInputStream (via Yaniv Kunda) + ({{{http://issues.apache.org/jira/browse/TIKA-1744}TIKA-1744}}). + + * Changed default content handler type for "/rmeta" in tika-server + to "xml" to align with "-J" option in tika-app. + Clients can now specify handler types via PathParam. + ({{{http://issues.apache.org/jira/browse/TIKA-1716}TIKA-1716}}). + + * The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data + for machine learning from PDF files is now integrated as a + Tika parser + ({{{http://issues.apache.org/jira/browse/TIKA-1699}TIKA-1699}}, + {{{http://issues.apache.org/jira/browse/TIKA-1712}TIKA-1712}}). + + * The ability to specify the Tesseract Config Path was added + to the OCR Parser + ({{{http://issues.apache.org/jira/browse/TIKA-1703}TIKA-1703}}). + + * Upgraded to ASM 5.0.4 + ({{{http://issues.apache.org/jira/browse/TIKA-1705}TIKA-1705}}). + + * Corrected Tika Config XML detector definition explicit loading + of MimeTypes + ({{{http://issues.apache.org/jira/browse/TIKA-1708}TIKA-1708}}) + + * In Tika Parsers, Batch, Server, App and Examples, use Apache + Commons IO instead of inlined ex-Commons classes, and the Java 7 + Standard Charset definitions + ({{{http://issues.apache.org/jira/browse/TIKA-1710}TIKA-1710}}) + + * Upgraded to Commons Compress 1.10, which enables zlib compressed + archives support + ({{{http://issues.apache.org/jira/browse/TIKA-1718}TIKA-1718}}) + + + The following people have contributed to Tika 1.11 by submitting or + commenting on the issues resolved in this release: + + * Alexander Widera + + * Bob Paulin + + * Chris A. Mattmann + + * Christian Wolfe + + * Jeremy B. Merrill + + * Jukka Zitting + + * Justin Palmer + + * Konstantin Gribov + + * Lewis John McGibbney + + * Nick Burch + + * Sujen Shah + + * Tim Allison + + * Yaniv Kunda + + See {{http://s.apache.org/fSj}} for more details on these contributions. Added: tika/site/src/site/apt/1.11/parser.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.11/parser.apt?rev=1710493&view=auto ============================================================================== --- tika/site/src/site/apt/1.11/parser.apt (added) +++ tika/site/src/site/apt/1.11/parser.apt Sun Oct 25 22:30:51 2015 @@ -0,0 +1,251 @@ + -------------------- + The Parser interface + -------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +The Parser interface + + The + {{{./api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}} + interface is the key concept of Apache Tika. It hides the complexity of + different file formats and parsing libraries while providing a simple and + powerful mechanism for client applications to extract structured text + content and metadata from all sorts of documents. All this is achieved + with a single method: + +--- +void parse( + InputStream stream, ContentHandler handler, Metadata metadata, + ParseContext context) throws IOException, SAXException, TikaException; +--- + + The <<<parse>>> method takes the document to be parsed and related metadata + as input and outputs the results as XHTML SAX events and extra metadata. + The parse context argument is used to specify context information (like + the current local) that is not related to any individual document. + The main criteria that lead to this design were: + + [Streamed parsing] The interface should require neither the client + application nor the parser implementation to keep the full document + content in memory or spooled to disk. This allows even huge documents + to be parsed without excessive resource requirements. + + [Structured content] A parser implementation should be able to + include structural information (headings, links, etc.) in the extracted + content. A client application can use this information for example to + better judge the relevance of different parts of the parsed document. + + [Input metadata] A client application should be able to include metadata + like the file name or declared content type with the document to be + parsed. The parser implementation can use this information to better + guide the parsing process. + + [Output metadata] A parser implementation should be able to return + document metadata in addition to document content. Many document + formats contain metadata like the name of the author that may be useful + to client applications. + + [Context sensitivity] While the default settings and behaviour of Tika + parsers should work well for most use cases, there are still situations + where more fine-grained control over the parsing process is desirable. + It should be easy to inject such context-specific information to the + parsing process without breaking the layers of abstraction. + + [] + + These criteria are reflected in the arguments of the <<<parse>>> method. + +* Document input stream + + The first argument is an + {{{http://docs.oracle.com/javase/6/docs/api/java/io/InputStream.html}InputStream}} + for reading the document to be parsed. + + If this document stream can not be read, then parsing stops and the thrown + {{{http://docs.oracle.com/javase/6/docs/api/java/io/IOException.html}IOException}} + is passed up to the client application. If the stream can be read but + not parsed (for example if the document is corrupted), then the parser + throws a {{{./api/org/apache/tika/exception/TikaException.html}TikaException}}. + + The parser implementation will consume this stream but <will not close it>. + Closing the stream is the responsibility of the client application that + opened it in the first place. The recommended pattern for using streams + with the <<<parse>>> method is: + +--- +InputStream stream = ...; // open the stream +try { + parser.parse(stream, ...); // parse the stream +} finally { + stream.close(); // close the stream +} +--- + + Some document formats like the OLE2 Compound Document Format used by + Microsoft Office are best parsed as random access files. In such cases the + content of the input stream is automatically spooled to a temporary file + that gets removed once parsed. A future version of Tika may make it possible + to avoid this extra file if the input document is already a file in the + local file system. See + {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status + of this feature request. + +* XHTML SAX events + + The parsed content of the document stream is returned to the client + application as a sequence of XHTML SAX events. XHTML is used to express + structured content of the document and SAX events enable streamed + processing. Note that the XHTML format is used here only to convey + structural information, not to render the documents for browsing! + + The XHTML SAX events produced by the parser implementation are sent to a + {{{http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}} + instance given to the <<<parse>>> method. If this the content handler + fails to process an event, then parsing stops and the thrown + {{{http://docs.oracle.com/javase/6/docs/api/org/xml/sax/SAXException.html}SAXException}} + is passed up to the client application. + + The overall structure of the generated event stream is (with indenting + added for clarity): + +--- +<html xmlns="http://www.w3.org/1999/xhtml"> + <head> + <title>...</title> + </head> + <body> + ... + </body> +</html> +--- + + Parser implementations typically use the + {{{./api/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}} + utility class to generate the XHTML output. + + Dealing with the raw SAX events can be a bit complex, so Apache Tika + comes with a number of utility classes that can be used to process and + convert the event stream to other representations. + + For example, the + {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}} + class can be used to extract just the body part of the XHTML output and + feed it either as SAX events to another content handler or as characters + to an output stream, a writer, or simply a string. The following code + snippet parses a document from the standard input stream and outputs the + extracted text content to standard output: + +--- +ContentHandler handler = new BodyContentHandler(System.out); +parser.parse(System.in, handler, ...); +--- + + Another useful class is + {{{./api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that + uses a background thread to parse the document and returns the extracted + text content as a character stream: + +--- +InputStream stream = ...; // the document to be parsed +Reader reader = new ParsingReader(parser, stream, ...); +try { + ...; // read the document text using the reader +} finally { + reader.close(); // the document stream is closed automatically +} +--- + +* Document metadata + + The third argument to the <<<parse>>> method is used to pass document + metadata both in and out of the parser. Document metadata is expressed + as an {{{./api/org/apache/tika/metadata/Metadata.html}Metadata}} object. + + The following are some of the more interesting metadata properties: + + [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains + the document. + + A client application can set this property to allow the parser to use + file name heuristics to determine the format of the document. + + The parser implementation may set this property if the file format + contains the canonical name of the file (for example the Gzip format + has a slot for the file name). + + [Metadata.CONTENT_TYPE] The declared content type of the document. + + A client application can set this property based on for example a HTTP + Content-Type header. The declared content type may help the parser to + correctly interpret the document. + + The parser implementation sets this property to the content type according + to which the document was parsed. + + [Metadata.TITLE] The title of the document. + + The parser implementation sets this property if the document format + contains an explicit title field. + + [Metadata.AUTHOR] The name of the author of the document. + + The parser implementation sets this property if the document format + contains an explicit author field. + + [] + + Note that metadata handling is still being discussed by the Tika development + team, and it is likely that there will be some (backwards incompatible) + changes in metadata handling before Tika 1.0. + +* Parse context + + + The final argument to the <<<parse>>> method is used to inject + context-specific information to the parsing process. This is useful + for example when dealing with locale-specific date and number formats + in Microsoft Excel spreadsheets. Another important use of the parse + context is passing in the delegate parser instance to be used by + two-phase parsers like the + {{{./api/org/apache/parser/pkg/PackageParser.html}PackageParser}} subclasses. + Some parser classes allow customization of the parsing process through + strategy objects in the parse context. + +* Parser implementations + + Apache Tika comes with a number of parser classes for parsing + {{{./formats.html}various document formats}}. You can also extend Tika + with your own parsers, and of course any contributions to Tika are + warmly welcome. + + The goal of Tika is to reuse existing parser libraries like + {{{http://pdfbox.apache.org/}PDFBox}} or + {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most + of the parser classes in Tika are adapters to such external libraries. + + Tika also contains some general purpose parser implementations that are + not targeted at any specific document formats. The most notable of these + is the {{{./api/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}} + class that encapsulates all Tika functionality into a single parser that + can handle any types of documents. This parser will automatically determine + the type of the incoming document based on various heuristics and will then + parse the document accordingly. + +* {More Examples} + + For more examples of calling Parsing with Apache Tika, please take a look at + the {{{./examples.html}Tika Examples page}}. Added: tika/site/src/site/apt/1.11/parser_guide.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.11/parser_guide.apt?rev=1710493&view=auto ============================================================================== --- tika/site/src/site/apt/1.11/parser_guide.apt (added) +++ tika/site/src/site/apt/1.11/parser_guide.apt Sun Oct 25 22:30:51 2015 @@ -0,0 +1,143 @@ + -------------------------------------------- + Get Tika parsing up and running in 5 minutes + -------------------------------------------- + Arturo Beltran + -------------------------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Get Tika parsing up and running in 5 minutes + + This page is a quick start guide showing how to add a new parser to Apache Tika. + Following the simple steps listed below your new parser can be running in only 5 minutes. + +%{toc|section=1|fromDepth=1} + +* {Getting Started} + + The {{{./gettingstarted.html}Getting Started}} document describes how to + build Apache Tika from sources and how to start using Tika in an application. Pay close attention + and follow the instructions in the "Getting and building the sources" section. + + +* {Add your MIME-Type} + + Tika loads the core, standard MIME-Types from the file + "org/apache/tika/mime/tika-mimetypes.xml", which comes from + {{{http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}} . + If your new MIME-Type is a standard one which is missing from Tika, + submit a patch for this file! + + If your MIME-Type needs adding, create a new file + "org/apache/tika/mime/custom-mimetypes.xml" in your codebase. + You should add to it something like this: + +--- + <?xml version="1.0" encoding="UTF-8"?> + <mime-info> + <mime-type type="application/hello"> + <glob pattern="*.hi"/> + </mime-type> + </mime-info> +--- + +* {Create your Parser class} + + Now, you need to create your new parser. This is a class that must + implement the Parser interface offered by Tika. Instead of implementing + the Parser interface directly, it is recommended that you extend the + abstract class AbstractParser if possible. AbstractParser handles + translating between API changes for you. + + A very simple Tika Parser looks like this: + +--- +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * @Author: Arturo Beltran + */ +package org.apache.tika.parser.hello; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Collections; +import java.util.Set; + +import org.apache.tika.exception.TikaException; +import org.apache.tika.metadata.Metadata; +import org.apache.tika.mime.MediaType; +import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.AbstractParser; +import org.apache.tika.sax.XHTMLContentHandler; +import org.xml.sax.ContentHandler; +import org.xml.sax.SAXException; + +public class HelloParser extends AbstractParser { + + private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("hello")); + public static final String HELLO_MIME_TYPE = "application/hello"; + + public Set<MediaType> getSupportedTypes(ParseContext context) { + return SUPPORTED_TYPES; + } + + public void parse( + InputStream stream, ContentHandler handler, + Metadata metadata, ParseContext context) + throws IOException, SAXException, TikaException { + + metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE); + metadata.set("Hello", "World"); + + XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); + xhtml.startDocument(); + xhtml.endDocument(); + } +} +--- + + Pay special attention to the definition of the SUPPORTED_TYPES static class + field in the parser class that defines what MIME-Types it supports. If + your MIME-Types aren't standard ones, ensure you listed them in a + "custom-mimetypes.xml" file so that Tika knows about them (see above). + + Is in the "parse" method where you will do all your work. This is, extract + the information of the resource and then set the metadata. + +* {List the new parser} + + Finally, you should explicitly tell the AutoDetectParser to include your new + parser. This step is only needed if you want to use the AutoDetectParser functionality. + If you figure out the correct parser in a different way, it isn't needed. + + List your new parser in: + {{{http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}} + + Modified: tika/site/src/site/apt/download.apt.vm URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/download.apt.vm?rev=1710493&r1=1710492&r2=1710493&view=diff ============================================================================== --- tika/site/src/site/apt/download.apt.vm (original) +++ tika/site/src/site/apt/download.apt.vm Sun Oct 25 22:30:51 2015 @@ -25,18 +25,18 @@ Download Apache Tika * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-${project.parent.version}-src.zip}Mirrors for apache-tika-${project.parent.version}-src.zip}} (source archive, {{{http://www.apache.org/dist/tika/tika-${project.parent.version}-src.zip.asc}PGP signature}})\ - SHA1: <<<b1573adcb194e2c09b77eccc3b1edd16bd4ac67d>>>\ - MD5: <<<092d8bbc51756b180a8d65bbd4620801>>> + SHA1: <<<d0dde7b3a4f1a2fb6ccd741552ea180dddab630a>>>\ + MD5: <<<ccca11a7e5c300e438b2a52012cf4e39>>> * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-app-${project.parent.version}.jar}Mirrors for tika-app-${project.parent.version}.jar}} (runnable jar, {{{http://www.apache.org/dist/tika/tika-app-${project.parent.version}.jar.asc}PGP signature}})\ - SHA1: <<<8803a37c5c9467058a4e116beaa97668dad192e1>>>\ - MD5: <<<a899be6467e446031315926c10b8763c>>>\ + SHA1: <<<59cc7c4c48a6a41899ca282d925b2738d05a45a8>>>\ + MD5: <<<3e133bcb3cd709fddd1bda3eebc1a0e5>>>\ * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-server-${project.parent.version}.jar}Mirrors for tika-server-${project.parent.version}.jar}} (runnable jar, {{{http://www.apache.org/dist/tika/tika-server-${project.parent.version}.jar.asc}PGP signature}})\ - SHA1: <<<7bbecca884fa014d40d4468967e9bbd74a64a273>>>\ - MD5: <<<973965a14c73a93315e756e62a18e8a0>>> + SHA1: <<<c1ca6453573fb7fa1f6b3d81dc4c9847a9a86a62>>>\ + MD5: <<<7e28f3288c3bcd0c26ac6f557ddfb977>>> [] Modified: tika/site/src/site/apt/index.apt.vm URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/index.apt.vm?rev=1710493&r1=1710492&r2=1710493&view=diff ============================================================================== --- tika/site/src/site/apt/index.apt.vm (original) +++ tika/site/src/site/apt/index.apt.vm Sun Oct 25 22:30:51 2015 @@ -39,6 +39,15 @@ Apache Tika - a content analysis toolkit Latest News + [25 October 2015: Apache Tika Release] + Apache Tika 1.11 has been released! This release includes several improvements + that better utilize Java7 support, that help extract more content using the + cTAKES clinical extraction system and GROBID journal parser, and improvements + to Tesseract extraction. Please see the + {{{https://dist.apache.org/repos/dist/release/tika/CHANGES-1.11.txt}CHANGES.txt}} + file for a full list of changes in this release and have a look at the download + page for more information on how to obtain Apache Tika 1.11. + [01 August 2015: Apache Tika Release] Apache Tika 1.10 has been released! This release includes several improvements including the ability to parse MS Access Files, composite parser creation via Tika Modified: tika/site/src/site/site.xml URL: http://svn.apache.org/viewvc/tika/site/src/site/site.xml?rev=1710493&r1=1710492&r2=1710493&view=diff ============================================================================== --- tika/site/src/site/site.xml (original) +++ tika/site/src/site/site.xml Sun Oct 25 22:30:51 2015 @@ -40,7 +40,17 @@ <item name="Issue Tracker" href="https://issues.apache.org/jira/browse/TIKA"/> </menu> <menu name="Documentation"> - <item name="Apache Tika 1.10" href="1.10/index.html"> + <item name="Apache Tika 1.11" href="1.11/index.html"> + <item name="Getting Started" href="1.11/gettingstarted.html"/> + <item name="Supported Formats" href="1.11/formats.html"/> + <item name="Parser API" href="1.11/parser.html"/> + <item name="Parser 5min Quick Start Guide" href="1.11/parser_guide.html"/> + <item name="Content and Language Detection" href="1.11/detection.html"/> + <item name="Configuring Tika" href="1.11/configuring.html"/> + <item name="Usage Examples" href="1.11/examples.html"/> + <item name="API Documentation" href="1.11/api/"/> + </item> + <item name="Apache Tika 1.10" href="1.10/index.html" collapse="true"> <item name="Getting Started" href="1.10/gettingstarted.html"/> <item name="Supported Formats" href="1.10/formats.html"/> <item name="Parser API" href="1.10/parser.html"/> @@ -69,63 +79,6 @@ <item name="Usage Examples" href="1.8/examples.html"/> <item name="API Documentation" href="1.8/api/"/> </item> - <item name="Apache Tika 1.7" href="1.7/index.html" collapse="true"> - <item name="Getting Started" href="1.7/gettingstarted.html"/> - <item name="Supported Formats" href="1.7/formats.html"/> - <item name="Parser API" href="1.7/parser.html"/> - <item name="Parser 5min Quick Start Guide" href="1.7/parser_guide.html"/> - <item name="Content and Language Detection" href="1.7/detection.html"/> - <item name="Usage Examples" href="1.7/examples.html"/> - <item name="API Documentation" href="1.7/api/"/> - </item> - <item name="Apache Tika 1.6" href="1.6/index.html" collapse="true"> - <item name="Getting Started" href="1.6/gettingstarted.html"/> - <item name="Supported Formats" href="1.6/formats.html"/> - <item name="Parser API" href="1.6/parser.html"/> - <item name="Parser 5min Quick Start Guide" href="1.6/parser_guide.html"/> - <item name="Content and Language Detection" href="1.6/detection.html"/> - <item name="API Documentation" href="1.6/api/"/> - </item> - <item name="Apache Tika 1.5" href="1.5/index.html" collapse="true"> - <item name="Getting Started" href="1.5/gettingstarted.html"/> - <item name="Supported Formats" href="1.5/formats.html"/> - <item name="Parser API" href="1.5/parser.html"/> - <item name="Parser 5min Quick Start Guide" href="1.5/parser_guide.html"/> - <item name="Content and Language Detection" href="1.5/detection.html"/> - <item name="API Documentation" href="1.5/api/"/> - </item> - <item name="Apache Tika 1.4" href="1.4/index.html" collapse="true"> - <item name="Getting Started" href="1.4/gettingstarted.html"/> - <item name="Supported Formats" href="1.4/formats.html"/> - <item name="Parser API" href="1.4/parser.html"/> - <item name="Parser 5min Quick Start Guide" href="1.4/parser_guide.html"/> - <item name="Content and Language Detection" href="1.4/detection.html"/> - <item name="API Documentation" href="1.4/api/"/> - </item> - <item name="Apache Tika 1.3" href="1.3/index.html" collapse="true"> - <item name="Getting Started" href="1.3/gettingstarted.html"/> - <item name="Supported Formats" href="1.3/formats.html"/> - <item name="Parser API" href="1.3/parser.html"/> - <item name="Parser 5min Quick Start Guide" href="1.3/parser_guide.html"/> - <item name="Content and Language Detection" href="1.3/detection.html"/> - <item name="API Documentation" href="1.3/api/"/> - </item> - <item name="Apache Tika 1.2" href="1.2/index.html" collapse="true"> - <item name="Getting Started" href="1.2/gettingstarted.html"/> - <item name="Supported Formats" href="1.2/formats.html"/> - <item name="Parser API" href="1.2/parser.html"/> - <item name="Parser 5min Quick Start Guide" href="1.2/parser_guide.html"/> - <item name="Content and Language Detection" href="1.2/detection.html"/> - <item name="API Documentation" href="1.2/api/"/> - </item> - <item name="Apache Tika 1.1" href="1.1/index.html" collapse="true"> - <item name="Getting Started" href="1.1/gettingstarted.html"/> - <item name="Supported Formats" href="1.1/formats.html"/> - <item name="Parser API" href="1.1/parser.html"/> - <item name="Parser 5min Quick Start Guide" href="1.1/parser_guide.html"/> - <item name="Content and Language Detection" href="1.1/detection.html"/> - <item name="API Documentation" href="1.1/api/"/> - </item> </menu> <menu name="The Apache Software Foundation"> <item name="About" href="http://www.apache.org/foundation/"/>
