examples.apt formats.apt

nick Mon, 04 May 2015 23:02:31 -0700

Author: nick
Date: Tue May  5 06:01:45 2015
New Revision: 1677744

URL: http://svn.apache.org/r1677744
Log:
Start the formats and examples pages for 1.9, and refer to the Tika in Action 
examples in summary


Added:
    tika/site/src/site/apt/1.9/
    tika/site/src/site/apt/1.9/examples.apt
    tika/site/src/site/apt/1.9/formats.apt

Added: tika/site/src/site/apt/1.9/examples.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/1.9/examples.apt?rev=1677744&view=auto
==============================================================================
--- tika/site/src/site/apt/1.9/examples.apt (added)
+++ tika/site/src/site/apt/1.9/examples.apt Tue May  5 06:01:45 2015
@@ -0,0 +1,148 @@
+                       -----------------------
+                       Tika API Usage Examples
+                       -----------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika API Usage Examples
+
+   This page provides a number of examples on how to use the various
+   Tika APIs. All of the examples shown are also available in the
+   {{{https://svn.apache.org/repos/asf/tika/trunk/tika-example}Tika Example
+    module}} in SVN.
+
+%{toc|section=1|fromDepth=1}
+
+
+* {Parsing}
+
+   Tika provides a number of different ways to parse a file. These provide 
+   different levels of control, flexibility, and complexity.
+
+** {Parsing using the Tika Facade}
+
+   The {{{./apidocs/org/apache/tika/Tika.html}Tika facade}},
+   provides a number of very quick and easy ways to have your content
+   parsed by Tika, and return the resulting plain text
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ParsingExample.java|snippet=aj:..parseToStringExample()|show-gutter=false}
+
+** {Parsing using the Auto-Detect Parser}
+
+   For more control, you can call the
+   {{{./apidocs/org/apache/tika/parser/Parser.html}Tika Parsers}}
+   directly. Most likely, you'll want to start out using the 
+   {{{./apidocs/org/apache/tika/parser/AutoDetectParser.html}Auto-Detect 
Parser}},
+   which automatically figures out what kind of content you have, then calls 
the appropriate
+   parser for you.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ParsingExample.java|snippet=aj:..parseExample()|show-gutter=false}
+
+
+* {Picking different output formats}
+
+   With Tika, you can get the textual content of your files returned
+   in a number of different formats. These can be plain text, html, xhtml,
+   xhtml of one part of the file etc. This is controlled based on the
+   
{{{http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+   you supply to the Parser.
+
+** {Parsing to Plain Text}
+
+   By using the 
+   
{{{./apidocs/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}},
+   you can request that Tika return only the content of the document's body as
+   a plain-text string.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToPlainText()|show-gutter=false}
+
+** {Parsing to XHTML}
+
+   By using the 
+   
{{{./apidocs/org/apache/tika/sax/ToXMLContentHandler.html}ToXMLContentHandler}},
+   you can get the XHTML content of the whole document as a string.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToHTML()|show-gutter=false}
+
+   If you just want the body of the xhtml document, without the header, you
+   can chain together a 
+   
{{{./apidocs/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+   and a 
{{{./apidocs/org/apache/tika/sax/ToXMLContentHandler.html}ToXMLContentHandler}}
+   as shown:
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseBodyToHTML()|show-gutter=false}
+
+** {Fetching just certain bits of the XHTML}
+
+   It possible to execute XPath queries on the parse results, to fetch
+   only certain bits of the XHTML. 
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseOnePartToHTML()|show-gutter=false}
+
+
+* {Custom Content Handlers}
+
+   The textual output of parsing a file with Tika is returned via the SAX 
+   
{{{http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+   you pass to the parse method. It is possible to customise your parsing by 
supplying your
+   own ContentHandler which does special things.
+
+** {Extract Phone Numbers from Content into the Metadata}
+
+   By using the 
+   
{{{./apidocs/org/apache/tika/sax/PhoneExtractingContentHandler.html}PhoneExtractingContentHandler}},
+   you can have any phone numbers found in the textual content of the document 
extracted and placed
+   into the Metadata object for you.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/GrabPhoneNumbersExample.java|snippet=aj:..process(..File)|show-gutter=false}
+
+** {Streaming the plain text in chunks}
+
+   Sometimes, you want to chunk the resulting text up, perhaps to output
+   as you go minimising memory use, perhaps to output to HDFS files, or
+   any other reason! With a small custom content handler, you can do that.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToPlainTextChunks()|show-gutter=false}
+
+
+* {Translation}
+
+   Tika provides a pluggable Translation system, which allow you to send the 
results of
+   parsing off to an external system or program to have the text translated 
into another
+   language.
+
+** {Translation using the Microsoft Translation API}
+
+   In order to use the Microsoft Translation API, you need to sign up for a 
Microsoft account,
+   get an API key, then pass the key to Tika before translating.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/TranslatorExample.java|snippet=aj:..microsoftTranslateToFrench(..String)|show-gutter=false}
+
+
+* {Language Identification}
+
+   Tika provides support for identifying the language of text, through the 
+   
{{{./apidocs/org/apache/tika/language/LanguageIdentifier.html}LanguageIdentifier}}
 class.
+   
+%{include|source=src/examples-src/main/java/org/apache/tika/example/LanguageIdentifierExample.java|snippet=aj:..identifyLanguage(..String)|show-gutter=false}
+
+* {Additional Examples}
+
+   A number of other examples are also available, including all of the examples
+   from the {{{http://manning.com/mattmann/}Tika In Action book}}. These can 
all
+   be found in the
+   {{{https://svn.apache.org/repos/asf/tika/trunk/tika-example}Tika Example
+    module}} in SVN.

Added: tika/site/src/site/apt/1.9/formats.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/1.9/formats.apt?rev=1677744&view=auto
==============================================================================
--- tika/site/src/site/apt/1.9/formats.apt (added)
+++ tika/site/src/site/apt/1.9/formats.apt Tue May  5 06:01:45 2015
@@ -0,0 +1,254 @@
+                       --------------------------
+                       Supported Document Formats
+                       --------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Supported Document Formats
+
+   This page lists all the document formats supported by Apache Tika 1.9.
+   Follow the links to the various parser class javadocs for more detailed
+   information about each document format and how it is parsed by Tika.
+
+%{toc|section=1|fromDepth=1}
+
+* {HyperText Markup Language}
+
+   The HyperText Markup Language (HTML) is the lingua franca of the web.
+   Tika uses the {{{http://home.ccil.org/~cowan/XML/tagsoup/}TagSoup}}
+   library to support virtually any kind of HTML found on the web.
+   The output from the
+   {{{./api/org/apache/tika/parser/html/HtmlParser.html}HtmlParser}} class
+   is guaranteed to be well-formed and valid XHTML, and various heuristics
+   are used to prevent things like inline scripts from cluttering the
+   extracted text content.
+
+* {XML and derived formats}
+
+   The Extensible Markup Language (XML) format is a generic format that can
+   be used for all kinds of content. Tika has custom parsers for some widely
+   used XML vocabularies like XHTML, OOXML and ODF, but the default
+   {{{./api/org/apache/tika/parser/xml/DcXMLParser.html}DcXMLParser}}
+   class simply extracts the text content of the document and ignores any XML
+   structure. The only exception to this rule are Dublin Core metadata
+   elements that are used for the document metadata.
+
+* {Microsoft Office document formats}
+
+   Microsoft Office and some related applications produce documents in the
+   generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The
+   older OLE 2 format was introduced in Microsoft Office version 97 and was
+   the default format until Office version 2007 and the new XML-based
+   OOXML format. The
+   {{{./api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}}
+   and
+   
{{{./api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html}OOXMLParser}}
+   classes use {{{http://poi.apache.org/}Apache POI}} libraries to support
+   text and metadata extraction from both OLE2 and OOXML documents.
+
+* {OpenDocument Format}
+
+   The OpenDocument format (ODF) is used most notably as the default format
+   of the OpenOffice.org office suite. The
+   
{{{./api/org/apache/tika/parser/odf/OpenDocumentParser.html}OpenDocumentParser}}
+   class supports this format and the earlier OpenOffice 1.0 format on which
+   ODF is based.
+
+* {iWorks document formats}
+
+   The various iWorks document formats (Numbers, Pages, Keynote) are supported
+   by the 
+   
{{{./api/org/apache/tika/parser/iwork/IWorkPackageParser.html}IWorkPackageParser}}
+   class, which extracts text and metadata.
+
+* {Portable Document Format}
+
+   The {{{./api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class
+   parsers Portable Document Format (PDF) documents using the
+   {{{http://pdfbox.apache.org/}Apache PDFBox}} library.
+
+* {Electronic Publication Format}
+
+   The {{{./api/org/apache/tika/parser/epub/EpubParser.html}EpubParser}} class
+   supports the Electronic Publication Format (EPUB) used for many digital
+   books.
+
+   The 
{{{./api/org/apache/tika/parser/xml/FictionBookParser.html}FictionBookParser}} 
class
+   supports the xml-based Fiction Book publishing format.
+
+* {Rich Text Format}
+
+   The {{{./api/org/apache/tika/parser/rtf/RTFParser.html}RTFParser}} class
+   uses the standard javax.swing.text.rtf feature to extract text content
+   from Rich Text Format (RTF) documents.
+
+* {Compression and packaging formats}
+
+   Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}}
+   library to support various compression and packaging formats. The
+   {{{./api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}}
+   class and its subclasses first parse the top level compression or
+   packaging format and then pass the unpacked document streams to a
+   second parsing stage using the parser instance specified in the
+   parse context. Formats supported include Tar, RAR, CPIO, Zip and 7Zip.
+
+* {Text formats}
+
+   Extracting text content from plain text files seems like a simple task
+   until you start thinking of all the possible character encodings. The
+   {{{./api/org/apache/tika/parser/txt/TXTParser.html}TXTParser}} class uses
+   encoding detection code from the {{{http://site.icu-project.org/}ICU}}
+   project to automatically detect the character encoding of a text document.
+
+* {Feed and Syndication formats}
+
+   The {{{./api/org/apache/tika/parser/feed/FeedParser.html}FeedParser}} class
+   supports the RSS and Atom feed syndication formats.
+
+   The 
{{{./api/org/apache/tika/parser/iptc/IptcAnpaParser.html}IptcAnpaParser}} class
+   supports the IPTC ANPA News Wire feed format.
+
+* {Help formats}
+
+   The {{{./api/org/apache/tika/parser/chm/ChmParser.html}ChmParser}} class
+   supports the CHM Help format.
+
+* {Audio formats}
+
+   Tika can detect several common audio formats and extract metadata
+   from them. Even text extraction is supported for some audio files that
+   contain lyrics or other textual content. Extracted metadata includes
+   sampling rates, channels, format information, artists, titles etc. The
+   {{{./api/org/apache/tika/parser/audio/AudioParser.html}AudioParser}}
+   and {{{./api/org/apache/tika/parser/audio/MidiParser.html}MidiParser}}
+   classes use standard javax.sound features to process simple audio
+   formats. The
+   {{{./api/org/apache/tika/parser/mp3/Mp3Parser.html}Mp3Parser}} class
+   adds support for the widely used MP3 format, and the
+   {{{./api/org/apache/tika/parser/mp4/MP4Parser.html}MP4Parser}} class
+   provides it for MP4 audio. The Ogg family of audio formats (Vorbis,
+   Speex, Opus, Flac etc) are supported by the
+   {{{./api/org/gagravarr/tika/VorbisParser.html}VorbisParser}},
+   {{{./api/org/gagravarr/tika/OpusParser.html}OpusParser}},
+   {{{./api/org/gagravarr/tika/SpeexParser.html}SpeexParser}} and
+   {{{./api/org/gagravarr/tika/FlacParser.html}FlacParser}}
+   classes.
+
+* {Image formats}
+
+   The {{{./api/org/apache/tika/parser/image/ImageParser.html}ImageParser}}
+   class uses the standard javax.imageio feature to extract simple metadata
+   from image formats supported by the Java platform, such as PNG, GIF
+   and BMP. More complex image metadata is available through the
+   {{{./api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class and
+   {{{./api/org/apache/tika/parser/image/TiffParser.html}TiffParser}} classes
+   that uses the metadata-extractor library to supports Exif metadata
+   extraction from Jpeg and Tiff images. The 
+   {{{./api/org/apache/tika/parser/image/PSDParser.html}PSDParser}} class
+   extracts metadata from PSD images. The
+   {{{./api/org/apache/tika/parser/image/BPGParser.html}BPGParser}} class
+   extracts simple metadata from BPG (Better Portable Graphics) images.
+
+   When extracting from images, it is also possible to chain in Tesseract via
+   the 
{{{./api/org/apache/tika/parser/ocr/TesseractOCRParser.html}TesseractOCRParser}}
+   to have OCR performed on the contents of the image.
+
+* {Video formats}
+
+   Tika supports the Flash video format using a simple parsing algorithm 
+   implemented in the
+   {{{./api/org/apache/tika/parser/flv/FLVParser}FLVParser}} class.
+
+   The MP4 family of video formats (MP4, Quicktime, 3GPP etc) is supported 
+   by the {{{./api/org/apache/tika/parser/mp4/MP4Parser}MP4Parser}} class,
+   which extracts metadata on the video, along with audio stream
+   (if present).
+
+   For the Ogg family of video formats, a limited amount of metadata is
+   extracted by the 
+   {{{./api/org/gagravarr/tika/OggParser.html}OggParser}} class.
+
+* {Java class files and archives}
+
+   The {{{./api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class
+   extracts class names and method signatures from Java class files, and
+   the {{{./api/org/apache/tika/parser/pkg/ZipParser.html}ZipParser}} class
+   supports also jar archives.
+
+* {Source code}
+
+   The 
{{{./api/org/apache/tika/parser/code/SourceCodeParser}SourceCodeParser}} class
+   handles a number of source code formats, including Java, C, C++ and Groovy.
+   It provides a formatted form of the code, along with some simple metadata.
+
+* {Mail formats}
+
+   The {{{./api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can
+   extract email messages from the mbox format used by many email archives
+   and Unix-style mailboxes.
+
+   The {{{./api/org/apache/tika/parser/mail/RFC822Parser.html}RFC822Parser}} 
can
+   process single email messages in the RFC 822 format used by many email 
clients
+   in their archives / exports.
+
+   The {{{./api/org/apache/tika/parser/mbox/PSTParser.html}PSDParser}} can
+   extract email messages from the Microsoft Outlook PST email format.
+
+* {CAD formats}
+
+   The {{{./api/org/apache/tika/parser/dwg/DWGParser.html}DWGParser}} can
+   extract simple metadata from the DWG CAD format.
+
+* {Font formats}
+
+   The 
{{{./api/org/apache/tika/parser/font/TrueTypeParser.html}TrueTypeParser}} 
+   class can extract simple metadata from the TrueType font format.
+   The 
{{{./api/org/apache/tika/parser/font/AdobeFontMetricParser.html}AdobeFontMetricParser}}
 
+   class does something similar for Adobe Font Metrics files.
+
+* {Scientific formats}
+
+   The {{{./api/org/apache/tika/parser/hdf/HDFParser.html}HDFParser}}
+   is able to extract attribute metadata from the HDF scientific file format.
+
+   The {{{./api/org/apache/tika/parser/netcdf/NetCDFParser.html}NetCDFParser}}
+   is able to extract attribute metadata from the NetCDF scientific file 
format.
+
+   The {{{./api/org/apache/tika/parser/mat/MatParser.html}MatParser}}
+   is able to extract attribute metadata from the Matlab scientific file 
format.
+
+   The {{{./api/org/apache/tika/parser/gdal/GDALParser.html}GDALParser}}
+   is able to extract attribute metadata from the GDAL scientific file format.
+
+   The {{{./api/org/apache/tika/parser/grib/GribParser.html}GribParser}}
+   is able to extract attribute metadata from the Grib scientific file format.
+
+* {Executable programs and libraries}
+
+   The 
{{{./api/org/apache/tika/parser/executable/ExecutableParser.html}ExecutableParser}}
 can
+   extract metadata information on platforms, architectures and types from a 
range
+   of executable formats and libraries, such as Windows Executables and Linux 
/ BSD 
+   programs and libraries.
+
+* {Crypto formats}
+
+   The {{{./api/org/apache/tika/parser/crypto/Pkcs7Parser.html}Pkcs7Parser}} 
is able to
+   parse the contents of PKCS7 signed messages, but doesn't include any 
information from
+   the outer PKCS7 wrapper.
+
+Full list of supported formats:
+
+   TODO Populate this at release time

svn commit: r1677744 - in /tika/site/src/site/apt/1.9: ./ examples.apt formats.apt

Reply via email to