Added: tika/site/src/site/apt/1.4/parser.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.4/parser.apt?rev=1499614&view=auto ============================================================================== --- tika/site/src/site/apt/1.4/parser.apt (added) +++ tika/site/src/site/apt/1.4/parser.apt Thu Jul 4 01:40:44 2013 @@ -0,0 +1,245 @@ + -------------------- + The Parser interface + -------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +The Parser interface + + The + {{{api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}} + interface is the key concept of Apache Tika. It hides the complexity of + different file formats and parsing libraries while providing a simple and + powerful mechanism for client applications to extract structured text + content and metadata from all sorts of documents. All this is achieved + with a single method: + +--- +void parse( + InputStream stream, ContentHandler handler, Metadata metadata, + ParseContext context) throws IOException, SAXException, TikaException; +--- + + The <<<parse>>> method takes the document to be parsed and related metadata + as input and outputs the results as XHTML SAX events and extra metadata. + The parse context argument is used to specify context information (like + the current local) that is not related to any individual document. + The main criteria that lead to this design were: + + [Streamed parsing] The interface should require neither the client + application nor the parser implementation to keep the full document + content in memory or spooled to disk. This allows even huge documents + to be parsed without excessive resource requirements. + + [Structured content] A parser implementation should be able to + include structural information (headings, links, etc.) in the extracted + content. A client application can use this information for example to + better judge the relevance of different parts of the parsed document. + + [Input metadata] A client application should be able to include metadata + like the file name or declared content type with the document to be + parsed. The parser implementation can use this information to better + guide the parsing process. + + [Output metadata] A parser implementation should be able to return + document metadata in addition to document content. Many document + formats contain metadata like the name of the author that may be useful + to client applications. + + [Context sensitivity] While the default settings and behaviour of Tika + parsers should work well for most use cases, there are still situations + where more fine-grained control over the parsing process is desirable. + It should be easy to inject such context-specific information to the + parsing process without breaking the layers of abstraction. + + [] + + These criteria are reflected in the arguments of the <<<parse>>> method. + +* Document input stream + + The first argument is an + {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}} + for reading the document to be parsed. + + If this document stream can not be read, then parsing stops and the thrown + {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}} + is passed up to the client application. If the stream can be read but + not parsed (for example if the document is corrupted), then the parser + throws a {{{api/org/apache/tika/exception/TikaException.html}TikaException}}. + + The parser implementation will consume this stream but <will not close it>. + Closing the stream is the responsibility of the client application that + opened it in the first place. The recommended pattern for using streams + with the <<<parse>>> method is: + +--- +InputStream stream = ...; // open the stream +try { + parser.parse(stream, ...); // parse the stream +} finally { + stream.close(); // close the stream +} +--- + + Some document formats like the OLE2 Compound Document Format used by + Microsoft Office are best parsed as random access files. In such cases the + content of the input stream is automatically spooled to a temporary file + that gets removed once parsed. A future version of Tika may make it possible + to avoid this extra file if the input document is already a file in the + local file system. See + {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status + of this feature request. + +* XHTML SAX events + + The parsed content of the document stream is returned to the client + application as a sequence of XHTML SAX events. XHTML is used to express + structured content of the document and SAX events enable streamed + processing. Note that the XHTML format is used here only to convey + structural information, not to render the documents for browsing! + + The XHTML SAX events produced by the parser implementation are sent to a + {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}} + instance given to the <<<parse>>> method. If this the content handler + fails to process an event, then parsing stops and the thrown + {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}} + is passed up to the client application. + + The overall structure of the generated event stream is (with indenting + added for clarity): + +--- +<html xmlns="http://www.w3.org/1999/xhtml"> + <head> + <title>...</title> + </head> + <body> + ... + </body> +</html> +--- + + Parser implementations typically use the + {{{apidocs/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}} + utility class to generate the XHTML output. + + Dealing with the raw SAX events can be a bit complex, so Apache Tika + comes with a number of utility classes that can be used to process and + convert the event stream to other representations. + + For example, the + {{{api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}} + class can be used to extract just the body part of the XHTML output and + feed it either as SAX events to another content handler or as characters + to an output stream, a writer, or simply a string. The following code + snippet parses a document from the standard input stream and outputs the + extracted text content to standard output: + +--- +ContentHandler handler = new BodyContentHandler(System.out); +parser.parse(System.in, handler, ...); +--- + + Another useful class is + {{{api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that + uses a background thread to parse the document and returns the extracted + text content as a character stream: + +--- +InputStream stream = ...; // the document to be parsed +Reader reader = new ParsingReader(parser, stream, ...); +try { + ...; // read the document text using the reader +} finally { + reader.close(); // the document stream is closed automatically +} +--- + +* Document metadata + + The third argument to the <<<parse>>> method is used to pass document + metadata both in and out of the parser. Document metadata is expressed + as an {{{api/org/apache/tika/metadata/Metadata.html}Metadata}} object. + + The following are some of the more interesting metadata properties: + + [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains + the document. + + A client application can set this property to allow the parser to use + file name heuristics to determine the format of the document. + + The parser implementation may set this property if the file format + contains the canonical name of the file (for example the Gzip format + has a slot for the file name). + + [Metadata.CONTENT_TYPE] The declared content type of the document. + + A client application can set this property based on for example a HTTP + Content-Type header. The declared content type may help the parser to + correctly interpret the document. + + The parser implementation sets this property to the content type according + to which the document was parsed. + + [Metadata.TITLE] The title of the document. + + The parser implementation sets this property if the document format + contains an explicit title field. + + [Metadata.AUTHOR] The name of the author of the document. + + The parser implementation sets this property if the document format + contains an explicit author field. + + [] + + Note that metadata handling is still being discussed by the Tika development + team, and it is likely that there will be some (backwards incompatible) + changes in metadata handling before Tika 1.0. + +* Parse context + + The final argument to the <<<parse>>> method is used to inject + context-specific information to the parsing process. This is useful + for example when dealing with locale-specific date and number formats + in Microsoft Excel spreadsheets. Another important use of the parse + context is passing in the delegate parser instance to be used by + two-phase parsers like the + {{{api/org/apache/parser/pkg/PackageParser.html}PackageParser}} subclasses. + Some parser classes allow customization of the parsing process through + strategy objects in the parse context. + +* Parser implementations + + Apache Tika comes with a number of parser classes for parsing + {{{formats.html}various document formats}}. You can also extend Tika + with your own parsers, and of course any contributions to Tika are + warmly welcome. + + The goal of Tika is to reuse existing parser libraries like + {{{http://www.pdfbox.org/}PDFBox}} or + {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most + of the parser classes in Tika are adapters to such external libraries. + + Tika also contains some general purpose parser implementations that are + not targeted at any specific document formats. The most notable of these + is the {{{apidocs/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}} + class that encapsulates all Tika functionality into a single parser that + can handle any types of documents. This parser will automatically determine + the type of the incoming document based on various heuristics and will then + parse the document accordingly.
Added: tika/site/src/site/apt/1.4/parser_guide.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.4/parser_guide.apt?rev=1499614&view=auto ============================================================================== --- tika/site/src/site/apt/1.4/parser_guide.apt (added) +++ tika/site/src/site/apt/1.4/parser_guide.apt Thu Jul 4 01:40:44 2013 @@ -0,0 +1,135 @@ + -------------------------------------------- + Get Tika parsing up and running in 5 minutes + -------------------------------------------- + Arturo Beltran + -------------------------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Get Tika parsing up and running in 5 minutes + + This page is a quick start guide showing how to add a new parser to Apache Tika. + Following the simple steps listed below your new parser can be running in only 5 minutes. + +%{toc|section=1|fromDepth=1} + +* {Getting Started} + + The {{{gettingstarted.html}Getting Started}} document describes how to + build Apache Tika from sources and how to start using Tika in an application. Pay close attention + and follow the instructions in the "Getting and building the sources" section. + + +* {Add your MIME-Type} + + You first need to modify {{{http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}} + in order to Tika can map the file extension with its MIME-Type. You should add something like this: + +--- + <mime-type type="application/hello"> + <glob pattern="*.hi"/> + </mime-type> +--- + +* {Create your Parser class} + + Now, you need to create your new parser. This is a class that must implement the Parser interface + offered by Tika. A very simple Tika Parser looks like this: + +--- +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * @Author: Arturo Beltran + */ +package org.apache.tika.parser.hello; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Collections; +import java.util.Set; + +import org.apache.tika.exception.TikaException; +import org.apache.tika.metadata.Metadata; +import org.apache.tika.mime.MediaType; +import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.Parser; +import org.apache.tika.sax.XHTMLContentHandler; +import org.xml.sax.ContentHandler; +import org.xml.sax.SAXException; + +public class HelloParser implements Parser { + + private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("hello")); + public static final String HELLO_MIME_TYPE = "application/hello"; + + public Set<MediaType> getSupportedTypes(ParseContext context) { + return SUPPORTED_TYPES; + } + + public void parse( + InputStream stream, ContentHandler handler, + Metadata metadata, ParseContext context) + throws IOException, SAXException, TikaException { + + metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE); + metadata.set("Hello", "World"); + + XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); + xhtml.startDocument(); + xhtml.endDocument(); + } + + /** + * @deprecated This method will be removed in Apache Tika 1.0. + */ + public void parse( + InputStream stream, ContentHandler handler, Metadata metadata) + throws IOException, SAXException, TikaException { + parse(stream, handler, metadata, new ParseContext()); + } +} +--- + + Pay special attention to the definition of the SUPPORTED_TYPES static class + field in the parser class that defines what MIME-Types it supports. + + Is in the "parse" method where you will do all your work. This is, extract + the information of the resource and then set the metadata. + +* {List the new parser} + + Finally, you should explicitly tell the AutoDetectParser to include your new + parser. This step is only needed if you want to use the AutoDetectParser functionality. + If you figure out the correct parser in a different way, it isn't needed. + + List your new parser in: + {{{http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}} + + Modified: tika/site/src/site/apt/download.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/download.apt?rev=1499614&r1=1499613&r2=1499614&view=diff ============================================================================== --- tika/site/src/site/apt/download.apt (original) +++ tika/site/src/site/apt/download.apt Thu Jul 4 01:40:44 2013 @@ -19,19 +19,19 @@ Download Apache Tika - Apache Tika 1.3 is now available. - See the {{{http://www.apache.org/dist/tika/CHANGES-1.3.txt}CHANGES.txt}} + Apache Tika 1.4 is now available. + See the {{{http://www.apache.org/dist/tika/CHANGES-1.4.txt}CHANGES.txt}} file for more information on the list of updates in this initial release. - * {{{http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.3-src.zip}apache-tika-1.3-src.zip}} - (source archive, {{{http://www.apache.org/dist/tika/apache-tika-1.3-src.zip.asc}PGP signature}})\ - SHA1: <<<a80e45d1976e655381d6e93b50b9c7b118e9d6fc>>>\ - MD5: <<<ce6cf28866e64201775261e0b558f84e>>> - - * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.3.jar}tika-app-1.3.jar}} - (runnable jar, {{{http://www.apache.org/dist/tika/tika-app-1.3.jar.asc}PGP signature}})\ - SHA1: <<<fb5786dfe4fa19a651c9f6d9417336127b34ddc2>>>\ - MD5: <<<783dd0f77b2b2fe39fe957657d3c5005>>> + * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-1.4-src.zip}apache-tika-1.4-src.zip}} + (source archive, {{{http://www.apache.org/dist/tika/tika-1.4-src.zip.asc}PGP signature}})\ + SHA1: <<<84ce9ebc104ca348a3cd8e95ec31a96169548c13>>>\ + MD5: <<<6daa446b1dfb08888169d558263416d7>>> + + * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.4.jar}tika-app-1.4.jar}} + (runnable jar, {{{http://www.apache.org/dist/tika/tika-app-1.4.jar.asc}PGP signature}})\ + SHA1: <<<e91c758149ce9ce799fff184e9bf3aabda394abc>>> + MD5: <<<53936b30a84a933389ea959a36dd963e>>> [] Modified: tika/site/src/site/apt/index.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/index.apt?rev=1499614&r1=1499613&r2=1499614&view=diff ============================================================================== --- tika/site/src/site/apt/index.apt (original) +++ tika/site/src/site/apt/index.apt Thu Jul 4 01:40:44 2013 @@ -23,7 +23,7 @@ Apache Tika - a content analysis toolkit structured text content from various documents using existing parser libraries. You can find the latest release on the {{{./download.html}download page}}. See the - {{{./1.2/gettingstarted.html}Getting Started}} guide for instructions on + {{{./1.4/gettingstarted.html}Getting Started}} guide for instructions on how to start using Tika. Tika is a project of the @@ -32,6 +32,12 @@ Apache Tika - a content analysis toolkit Latest News + [3 July 2013: Apache Tika Release] + Apache Tika 1.4 has been released! This release includes several important bugfixes + and new features. Please see the {{{http://www.apache.org/dist/tika/CHANGES-1.4.txt}CHANGES.txt}} + file for a full list of changes in this release, and have a look at the download + page for more information on how to obtain Apache Tika 1.4. + [22 January 2013: Apache Tika Release] Apache Tika 1.3 has been released! This release includes several important bugfixes and new features. Please see the {{{http://www.apache.org/dist/tika/CHANGES-1.3.txt}CHANGES.txt}} Modified: tika/site/src/site/site.xml URL: http://svn.apache.org/viewvc/tika/site/src/site/site.xml?rev=1499614&r1=1499613&r2=1499614&view=diff ============================================================================== --- tika/site/src/site/site.xml (original) +++ tika/site/src/site/site.xml Thu Jul 4 01:40:44 2013 @@ -39,7 +39,15 @@ <item name="Issue Tracker" href="https://issues.apache.org/jira/browse/TIKA"/> </menu> <menu name="Documentation"> - <item name="Apache Tika 1.3" href="1.3/index.html"> + <item name="Apache Tika 1.4" href="1.4/index.html"> + <item name="Getting Started" href="1.4/gettingstarted.html"/> + <item name="Supported Formats" href="1.4/formats.html"/> + <item name="Parser API" href="1.4/parser.html"/> + <item name="Parser 5min Quick Start Guide" href="1.4/parser_guide.html"/> + <item name="Content and Language Detection" href="1.4/detection.html"/> + <item name="API Documentation" href="1.4/api/"/> + </item> + <item name="Apache Tika 1.3" href="1.3/index.html" collapse="true"> <item name="Getting Started" href="1.3/gettingstarted.html"/> <item name="Supported Formats" href="1.3/formats.html"/> <item name="Parser API" href="1.3/parser.html"/> @@ -71,14 +79,6 @@ <item name="Content and Language Detection" href="1.0/detection.html"/> <item name="API Documentation" href="1.0/api/"/> </item> - <item name="Apache Tika 0.10" href="0.10/index.html" collapse="true"> - <item name="Getting Started" href="0.10/gettingstarted.html"/> - <item name="Supported Formats" href="0.10/formats.html"/> - <item name="Parser API" href="0.10/parser.html"/> - <item name="Parser 5min Quick Start Guide" href="0.10/parser_guide.html"/> - <item name="Content and Language Detection" href="0.10/detection.html"/> - <item name="API Documentation" href="0.10/api/"/> - </item> </menu> <menu name="The Apache Software Foundation"> <item name="About" href="http://www.apache.org/foundation/"/>
