Author: nick
Date: Wed Mar 30 11:52:57 2011
New Revision: 1086912
URL: http://svn.apache.org/viewvc?rev=1086912&view=rev
Log:
TIKA-624 - Update supported formats for 0.8 and 0.9
Modified:
tika/site/src/site/apt/0.8/formats.apt
tika/site/src/site/apt/0.9/formats.apt
Modified: tika/site/src/site/apt/0.8/formats.apt
URL:
http://svn.apache.org/viewvc/tika/site/src/site/apt/0.8/formats.apt?rev=1086912&r1=1086911&r2=1086912&view=diff
==============================================================================
--- tika/site/src/site/apt/0.8/formats.apt (original)
+++ tika/site/src/site/apt/0.8/formats.apt Wed Mar 30 11:52:57 2011
@@ -19,7 +19,7 @@
Supported Document Formats
- This page lists all the document formats supported by Apache Tika 0.6.
+ This page lists all the document formats supported by Apache Tika 0.8.
Follow the links to the various parser class javadocs for more detailed
information about each document format and how it is parsed by Tika.
@@ -46,6 +46,11 @@ Supported Document Formats
structure. The only exception to this rule are Dublin Core metadata
elements that are used for the document metadata.
+ Tika also includes
+ {{{api/org/apache/tika/parser/feed/FeedParser.html}FeedParser}} which
+ is able to extract metadata and content from XML based feeds such as
+ RSS and Atom.
+
* {Microsoft Office document formats}
Microsoft Office and some related applications produce documents in the
@@ -59,6 +64,10 @@ Supported Document Formats
classes use {{{http://poi.apache.org/}Apache POI}} libraries to support
text and metadata extraction from both OLE2 and OOXML documents.
+ In addition to office documents, the
+ {{{api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}}
+ is also able to extract text and metadata from Outlook .msg emails.
+
* {OpenDocument Format}
The OpenDocument format (ODF) is used most notably as the default format
@@ -67,6 +76,12 @@ Supported Document Formats
class supports this format and the earlier OpenOffice 1.0 format on which
ODF is based.
+* {Apple iWorks Formats}
+
+ The iWorks formats of Numbers, Pages and Keynote are used by Apple's iWork
+ office suite. The
{{{api/org/apache/tika/parser/iwork/IWorkParser.html}IWorkParser}}
+ is able to extract text and metadata from these files.
+
* {Portable Document Format}
The {{{api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class
@@ -121,9 +136,10 @@ Supported Document Formats
class uses the standard javax.imageio feature to extract simple metadata
from image formats supported by the Java platform. More complex image
metadata is available through the
- {{{api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class
+ {{{api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} and
+ {{{api/org/apache/tika/parser/tiff/TiffParser.html}TiffParser}} classes
that uses the metadata-extractor library to supports Exif metadata
- extraction from Jpeg images.
+ extraction from Jpeg and Tiff images.
* {Video formats}
@@ -143,3 +159,22 @@ Supported Document Formats
The {{{api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can
extract email messages from the mbox format used by many email archives
and Unix-style mailboxes.
+
+* {The DWG (AutoCAD) format}
+
+ The {{{api/org/apache/tika/parser/dwg/DWGParser.html}DWGParser}} can
+ extract metadata (but not textual contents) from the DWG format that
+ is used by AutoCAD.
+
+* {Font formats}
+
+ The {{{api/org/apache/tika/parser/font/TrueTypeParser.html}TrueTypeParser}}
+ can extract limited metadata from TrueType fonts.
+
+* {Scientific formats}
+
+ The {{{api/org/apache/tika/parser/hdf/HDFParser.html}HDFParser}}
+ is able to extract attribute metadata from the HDF scientific file format.
+
+ The {{{api/org/apache/tika/parser/netcdf/NetCDFParser.html}NetCDFParser}}
+ is able to extract attribute metadata from the NetCDF scientific file
format.
Modified: tika/site/src/site/apt/0.9/formats.apt
URL:
http://svn.apache.org/viewvc/tika/site/src/site/apt/0.9/formats.apt?rev=1086912&r1=1086911&r2=1086912&view=diff
==============================================================================
--- tika/site/src/site/apt/0.9/formats.apt (original)
+++ tika/site/src/site/apt/0.9/formats.apt Wed Mar 30 11:52:57 2011
@@ -19,7 +19,7 @@
Supported Document Formats
- This page lists all the document formats supported by Apache Tika 0.6.
+ This page lists all the document formats supported by Apache Tika 0.9.
Follow the links to the various parser class javadocs for more detailed
information about each document format and how it is parsed by Tika.
@@ -46,6 +46,11 @@ Supported Document Formats
structure. The only exception to this rule are Dublin Core metadata
elements that are used for the document metadata.
+ Tika also includes
+ {{{api/org/apache/tika/parser/feed/FeedParser.html}FeedParser}} which
+ is able to extract metadata and content from XML based feeds such as
+ RSS and Atom.
+
* {Microsoft Office document formats}
Microsoft Office and some related applications produce documents in the
@@ -67,6 +72,12 @@ Supported Document Formats
class supports this format and the earlier OpenOffice 1.0 format on which
ODF is based.
+* {Apple iWorks Formats}
+
+ The iWorks formats of Numbers, Pages and Keynote are used by Apple's iWork
+ office suite. The
{{{api/org/apache/tika/parser/iwork/IWorkParser.html}IWorkParser}}
+ is able to extract text and metadata from these files.
+
* {Portable Document Format}
The {{{api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class
@@ -121,9 +132,10 @@ Supported Document Formats
class uses the standard javax.imageio feature to extract simple metadata
from image formats supported by the Java platform. More complex image
metadata is available through the
- {{{api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class
+ {{{api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} and
+ {{{api/org/apache/tika/parser/tiff/TiffParser.html}TiffParser}} classes
that uses the metadata-extractor library to supports Exif metadata
- extraction from Jpeg images.
+ extraction from Jpeg and Tiff images.
* {Video formats}
@@ -138,8 +150,34 @@ Supported Document Formats
the {{{api/org/apache/tika/parser/pkg/ZipParser.html}ZipParser}} class
supports also jar archives.
-* {The mbox format}
+* {Mail formats}
The {{{api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can
extract email messages from the mbox format used by many email archives
and Unix-style mailboxes.
+
+ The {{{api/org/apache/tika/parser/rfc822/RFC822Parser.html}RFC822Parser}}
can
+ extract email messages from the RFC822 format of email messages.
+
+ In addition to office documents, the
+ {{{api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}}
+ is also able to extract text and metadata from Outlook .msg emails.
+
+* {The DWG (AutoCAD) format}
+
+ The {{{api/org/apache/tika/parser/dwg/DWGParser.html}DWGParser}} can
+ extract metadata (but not textual contents) from the DWG format that
+ is used by AutoCAD.
+
+* {Font formats}
+
+ The {{{api/org/apache/tika/parser/font/TrueTypeParser.html}TrueTypeParser}}
+ can extract limited metadata from TrueType fonts.
+
+* {Scientific formats}
+
+ The {{{api/org/apache/tika/parser/hdf/HDFParser.html}HDFParser}}
+ is able to extract attribute metadata from the HDF scientific file format.
+
+ The {{{api/org/apache/tika/parser/netcdf/NetCDFParser.html}NetCDFParser}}
+ is able to extract attribute metadata from the NetCDF scientific file
format.