Author: nick
Date: Tue May 5 06:17:55 2015
New Revision: 1677745
URL: http://svn.apache.org/r1677745
Log:
List parsers which are new in 1.9, along with fixing a few older entries
spotted at the same time
Modified:
tika/site/src/site/apt/1.9/formats.apt
Modified: tika/site/src/site/apt/1.9/formats.apt
URL:
http://svn.apache.org/viewvc/tika/site/src/site/apt/1.9/formats.apt?rev=1677745&r1=1677744&r2=1677745&view=diff
==============================================================================
--- tika/site/src/site/apt/1.9/formats.apt (original)
+++ tika/site/src/site/apt/1.9/formats.apt Tue May 5 06:17:55 2015
@@ -59,6 +59,9 @@ Supported Document Formats
classes use {{{http://poi.apache.org/}Apache POI}} libraries to support
text and metadata extraction from both OLE2 and OOXML documents.
+ Old, pre-OLE2 Excel files (Excel 2, 3 and 4) are handled by the
+
{{{./api/org/apache/tika/parser/microsoft/OldExcelParser.html}OldExcelParser}}.
+
* {OpenDocument Format}
The OpenDocument format (ODF) is used most notably as the default format
@@ -99,11 +102,13 @@ Supported Document Formats
Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}}
library to support various compression and packaging formats. The
+ {{{./api/org/apache/tika/parser/pkg/CompressorParser.html}CompressorParser}}
+ class handles parsing of the top level compression formats, then
{{{./api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}}
- class and its subclasses first parse the top level compression or
- packaging format and then pass the unpacked document streams to a
- second parsing stage using the parser instance specified in the
- parse context. Formats supported include Tar, RAR, CPIO, Zip and 7Zip.
+ class and its subclasses parse the packaging formats and then pass the
+ unpacked document streams to a second parsing stage using the parser
+ instance specified in the parse context. Formats supported include Tar,
+ RAR, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200.
* {Text formats}
@@ -161,6 +166,8 @@ Supported Document Formats
extracts metadata from PSD images. The
{{{./api/org/apache/tika/parser/image/BPGParser.html}BPGParser}} class
extracts simple metadata from BPG (Better Portable Graphics) images.
+ The {{{./api/org/apache/tika/parser/image/WebPParser.html}WebPParser}}
+ class extracts simple metadata from WebP image format.
When extracting from images, it is also possible to chain in Tesseract via
the
{{{./api/org/apache/tika/parser/ocr/TesseractOCRParser.html}TesseractOCRParser}}
@@ -170,7 +177,7 @@ Supported Document Formats
Tika supports the Flash video format using a simple parsing algorithm
implemented in the
- {{{./api/org/apache/tika/parser/flv/FLVParser}FLVParser}} class.
+ {{{./api/org/apache/tika/parser/video/FLVParser}FLVParser}} class.
The MP4 family of video formats (MP4, Quicktime, 3GPP etc) is supported
by the {{{./api/org/apache/tika/parser/mp4/MP4Parser}MP4Parser}} class,
@@ -204,9 +211,13 @@ Supported Document Formats
process single email messages in the RFC 822 format used by many email
clients
in their archives / exports.
- The {{{./api/org/apache/tika/parser/mbox/PSTParser.html}PSDParser}} can
+ The
{{{./api/org/apache/tika/parser/mbox/OutlookPSTParser.html}OutlookPSTParser}}
can
extract email messages from the Microsoft Outlook PST email format.
+ The {{{./api/org/apache/tika/parser/microsoft/TNEFParser.html}TNEFParser}
can
+ extract email attachments from the Microsoft TNEF (Transport Neutral
Encoding
+ Format, aka Winmail.dat) used with some Microsoft email clients.
+
* {CAD formats}
The {{{./api/org/apache/tika/parser/dwg/DWGParser.html}DWGParser}} can
@@ -221,21 +232,33 @@ Supported Document Formats
* {Scientific formats}
+ The {{{./api/org/apache/tika/parser/dif/DIFParser.html}DIFParser}}
+ is able to extract attribute metadata from the GCMD Directory
+ Interchange Format (DIF) scientific file format.
+
+ The {{{./api/org/apache/tika/parser/gdal/GDALParser.html}GDALParser}}
+ is able to extract attribute metadata from the GDAL scientific file format.
+
+ The
{{{./api/org/apache/tika/parser/geoinfo/GeographicInformationParser.html}GeographicInformationParser}}
+ is able to extract attribute metadata from the ISO-19139 georgraphic
+ information file format.
+
+ The {{{./api/org/apache/tika/parser/grib/GribParser.html}GribParser}}
+ is able to extract attribute metadata from the Grib scientific file format.
+
The {{{./api/org/apache/tika/parser/hdf/HDFParser.html}HDFParser}}
is able to extract attribute metadata from the HDF scientific file format.
+ The
{{{./api/org/apache/tika/parser/isatab/ISArchiveParser.html}ISArchiveParser}
+ is able to extract attribute metadata from the ISA-Tab (ISA Tools) family of
+ scientific file formats.
+
The {{{./api/org/apache/tika/parser/netcdf/NetCDFParser.html}NetCDFParser}}
is able to extract attribute metadata from the NetCDF scientific file
format.
The {{{./api/org/apache/tika/parser/mat/MatParser.html}MatParser}}
is able to extract attribute metadata from the Matlab scientific file
format.
- The {{{./api/org/apache/tika/parser/gdal/GDALParser.html}GDALParser}}
- is able to extract attribute metadata from the GDAL scientific file format.
-
- The {{{./api/org/apache/tika/parser/grib/GribParser.html}GribParser}}
- is able to extract attribute metadata from the Grib scientific file format.
-
* {Executable programs and libraries}
The
{{{./api/org/apache/tika/parser/executable/ExecutableParser.html}ExecutableParser}}
can