svn commit: r1905121 [1/2] - in /tika/site: ./ src/site/ src/site/apt/ src/site/apt/2.6.0/ src/site/resources/

tallison Mon, 07 Nov 2022 03:40:56 -0800

Author: tallison
Date: Mon Nov  7 11:40:42 2022
New Revision: 1905121

URL: http://svn.apache.org/viewvc?rev=1905121&view=rev
Log:
Update website for 2.6.0 release


Added:
    tika/site/src/site/apt/2.6.0/
    tika/site/src/site/apt/2.6.0/configuring.apt
    tika/site/src/site/apt/2.6.0/detection.apt
    tika/site/src/site/apt/2.6.0/examples.apt
    tika/site/src/site/apt/2.6.0/formats.apt
    tika/site/src/site/apt/2.6.0/gettingstarted.apt
    tika/site/src/site/apt/2.6.0/index.apt
    tika/site/src/site/apt/2.6.0/parser.apt
    tika/site/src/site/apt/2.6.0/parser_guide.apt
Modified:
    tika/site/pom.xml
    tika/site/src/site/apt/index.apt.vm
    tika/site/src/site/resources/doap.rdf
    tika/site/src/site/site.xml

Modified: tika/site/pom.xml
URL: 
http://svn.apache.org/viewvc/tika/site/pom.xml?rev=1905121&r1=1905120&r2=1905121&view=diff
==============================================================================
--- tika/site/pom.xml (original)
+++ tika/site/pom.xml Mon Nov  7 11:40:42 2022
@@ -28,7 +28,7 @@
   <parent>
     <groupId>org.apache.tika</groupId>
     <artifactId>tika-parent</artifactId>
-    <version>2.5.0</version>
+    <version>2.6.0</version>
   </parent>
 
   <artifactId>tika-site</artifactId>

Added: tika/site/src/site/apt/2.6.0/configuring.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/configuring.apt?rev=1905121&view=auto
==============================================================================
--- tika/site/src/site/apt/2.6.0/configuring.apt (added)
+++ tika/site/src/site/apt/2.6.0/configuring.apt Mon Nov  7 11:40:42 2022
@@ -0,0 +1,223 @@
+                          ----------------
+                          Configuring Tika
+                          ----------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Configuring Tika
+
+   Out of the box, Apache Tika will attempt to start with all available
+   Detectors and Parsers, running with sensible defaults. For most users,
+   this default configuration will work well.
+
+   This page gives you information on how to configure the various
+   components of Apache Tika, such as Parsers and Detectors, if you need
+   fine-grained control over ordering, exclusions and the like.
+
+%{toc|section=1|fromDepth=1}
+
+* {Configuring Parsers}
+
+    Through the Tika Config xml, it is possible to have a high degree of 
control
+    over which parsers are or aren't used, in what order of preferences etc. 
It 
+    is also possible to override just certain parts, to (for example) have 
"default
+    except for PDF".
+
+    Currently, it is only possible to have a single parser run against a 
document.
+    There is on-going discussion around fallback parsers and combining the 
output
+    of multiple parsers running on a document, but none of these are available 
yet.
+
+    To override some parser certain default behaviours, include the <<< 
DefaultParser >>>
+    in your configuration, with excludes, then add other parser definitions in.
+    To prevent the <<< DefaultParser >>> (with its auto-discovery) being used, 
+    simply omit it from your config, and list all other parsers you want 
instead.
+
+    To override just some default behaviour, you can use a Tika Config 
something
+    like this:
+
+---
+<?xml version="1.0" encoding="UTF-8"?>
+<properties>
+  <parsers>
+    <!-- Default Parser for most things, except for 2 mime types, and never
+         use the Executable Parser -->
+    <parser class="org.apache.tika.parser.DefaultParser">
+      <mime-exclude>image/jpeg</mime-exclude>
+      <mime-exclude>application/pdf</mime-exclude>
+      <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>
+    </parser>
+    <!-- Use a different parser for PDF -->
+    <parser class="org.apache.tika.parser.EmptyParser">
+      <mime>application/pdf</mime>
+    </parser>
+  </parsers>
+</properties>
+---
+
+    To configure things in code, the key classes to use to build up your own 
custom 
+    parser heirarchy are 
+    
{{{./api/org/apache/tika/parser/DefaultParser.html}org.apache.tika.parser.DefaultParser}},
+    
{{{./api/org/apache/tika/parser/CompositeParser.html}org.apache.tika.parser.CompositeParser}}
+    and
+    
{{{./api/org/apache/tika/parser/ParserDecorator.html}org.apache.tika.parser.ParserDecorator}}.
+
+* {Configuring Detectors}
+
+    Through the Tika Config xml, it is possible to have a high degree of 
control
+    over which detectors are or aren't used, in what order of preferences etc. 
It 
+    is also possible to override just certain parts, to (for example) have 
"default
+    except for no POIFS Container Detction".
+
+    To override some detector certain default behaviours, include the 
+    <<< DefaultDetector >>>, with any <<< detector-exclude >>> entries you 
need,
+    in your configuration, then add other detectors definitions in. To prevent 
+    the <<< DefaultParser >>> (with its auto-discovery) being used, simply 
omit it 
+    from your config, and list all other detectors you want instead.
+
+    To override just some default behaviour, you can use a Tika Config 
something
+    like this:
+
+---
+<?xml version="1.0" encoding="UTF-8"?>
+<properties>
+  <detectors>
+    <!-- All detectors except built-in container ones -->
+    <detector class="org.apache.tika.detect.DefaultDetector">
+      <detector-exclude 
class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
+      <detector-exclude 
class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/>
+    </detector>
+  </detectors>
+</properties>
+---
+
+    Or to just only use certain detectors, you can use a Tika Config something
+    like this:
+
+---
+<?xml version="1.0" encoding="UTF-8"?>
+<properties>
+  <detectors>
+    <!-- Only use these two detectors, and ignore all others -->
+    <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
+    <detector class="org.apache.tika.mime.MimeTypes"/>
+  </detectors>
+</properties>
+---
+
+    In code, the key classes to use to build up your own custom detector
+    heirarchy are 
+    
{{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}}
+    and
+    
{{{./api/org/apache/tika/detect/CompositeDetector.html}org.apache.tika.detect.CompositeDetector}}.
+
+* {Configuring Mime Types}
+
+    TODO Mention non-standard paths, and custom mime type files
+
+* {Configuring Language Identifiers}
+
+    At this time, there is no unified way to configure language identifiers.
+    While the work on that is ongoing, for now you will need to review the
+    {{{./api/}Tika Javadocs}} to see how individual identifiers are configured.
+
+* {Configuring Translators}
+
+    At this time, there is no unified way to configure Translators.
+    While the work on that is ongoing, for now you will need to review the
+    {{{./api/}Tika Javadocs}} to see how individual Translators are configured.
+    
+~~ When Translators can have their parameters configured, mention here about
+~~ specifying which single one to use in the Tika Config XML
+
+* {Configuring the Service Loader}
+
+    Tika has a number of service provider types such as parsers, detectors, 
and translators.  
+    The 
{{{./api/org/apache/tika/config/ServiceLoader.html}org.apache.tika.config.ServiceLoader}}
 class provides a registry of each type of provider.  This allows Tika to create
+    implementations such as 
{{{./api/org/apache/tika/parser/DefaultParser.html}org.apache.tika.parser.DefaultParser}},
 
+    
{{{./api/org/apache/tika/language/translate/DefaultTranslator.html}org.apache.tika.language.translate.DefaultTranslator}},
 and 
{{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}}
 
+    that can match the appropriate provider to an incoming piece of content.
+    
+    The ServiceLoader's registry can be populated either statically or 
dynamically.
+    
+** Static
+
+    Static loading is the default which requires no configuration.  This 
configuration options is used in
+    Tika deployments where the Tika JAR files reside together in the same 
classloader hierarchy.  The services 
+    provides are loaded from provider configuration files located within the 
tika-parsers JAR file at META-INF/services.
+    
+** Dynamic
+
+    Dynamic loading may be required if the tika service providers will reside 
in different classloaders such as 
+    in OSGi.  To allow a provider created in tika-config.xml to utilize 
dynamically loaded services you need to 
+    configure the ServiceLoader to be dynamic with the following configuration:
+    
+---
+<properties>
+  <service-loader dynamic="true"/>
+  ....
+</properties>
+---
+
+** Load Error Handling
+
+    The ServiceLoader can contains a handler to deal with errors that occur 
during provider initialization.  For example
+    if a class fails to initialize LoadErrorHandler deals with the exception 
that is thrown.
+    This handler can be configured to:
+    
+    * <<< IGNORE >>> - (Default) Do nothing when providers fail to initialize.
+
+    * <<< WARN   >>> - Log a warning when providers fail to initialize.
+
+    * <<< THROW  >>> - Throw an exception when providers fail to initialize.
+
+    []
+
+    For example to set the LoadErrorHandler to WARN then use the following 
configuration:
+
+---
+<properties>
+  <service-loader loadErrorHandler="WARN"/>
+  ....
+</properties>
+---
+
+* {Using a Tika Configuration XML file}
+
+    However you call Tika, the System Property of <<< tika.config >>> is
+    checked first, and the Environment Variable of <<< TIKA_CONFIG >>> is
+    tried next. Setting one of those will cause Tika to use your given
+    Tika Config XML file.
+
+    If you are calling Tika from your own code, then you can pass in the
+    location of your Tika Config XML file when you construct your 
+    <<<TikaConfig>>> instance. From that, you can fetch your configured
+    parser, detectors etc.
+
+---
+TikaConfig config = new TikaConfig("/path/to/tika-config.xml");
+Detector detector = config.getDetector();
+Parser autoDetectParser = new AutoDetectParser(config);
+---
+
+    For users of the Tika App, in addition to the sytem property and the
+    environement variable, you can also use the 
+    <<< --config=[tika-config.xml] >>> option to select a different
+    Tika Config XML file to use
+
+    For users of the Tika Server, in addition to the sytem property and the
+    environement variable, you can also use <<< -c [tika-config.xml] >>> or
+    <<< --config [tika-config.xml] >>> options to select a different
+    Tika Config XML file to use

Added: tika/site/src/site/apt/2.6.0/detection.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/detection.apt?rev=1905121&view=auto
==============================================================================
--- tika/site/src/site/apt/2.6.0/detection.apt (added)
+++ tika/site/src/site/apt/2.6.0/detection.apt Mon Nov  7 11:40:42 2022
@@ -0,0 +1,223 @@
+                          -----------------
+                          Content Detection
+                          -----------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Content Detection
+
+   This page gives you information on how content and language detection
+   works with Apache Tika, and how to tune the behaviour of Tika.
+
+%{toc|section=1|fromDepth=1}
+
+* {The Detector Interface}
+
+  The
+  
{{{./api/org/apache/tika/detect/Detector.html}org.apache.tika.detect.Detector}}
+  interface is the basis for most of the content type detection in Apache
+  Tika. All the different ways of detecting content all implement the
+  same common method:
+
+---
+MediaType detect(java.io.InputStream input,
+                 Metadata metadata) throws java.io.IOException
+---
+
+   The <<<detect>>> method takes the stream to inspect, and a 
+   <<<Metadata>>> object that holds any additional information on
+   the content. The detector will return a 
+   {{{./api/org/apache/tika/mime/MediaType.html}MediaType}} object describing
+   its best guess as to the type of the file.
+
+   In general, three keys on the Metadata object are used by Detectors.
+   These are <<<TikaCoreProperties.RESOURCE_NAME_KEY>>> which should hold the 
name
+   of the file (where known), and <<<Metadata.CONTENT_TYPE>>> which should
+   hold the advertised content type of the file (eg from a webserver or
+   a content repository). Users may override automatic detection with the
+   <<<TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE>>> key.
+
+
+* {Mime Magic Detection}
+
+  By looking for special ("magic") patterns of bytes near the start of
+  the file, it is often possible to detect the type of the file. For
+  some file types, this is a simple process. For others, typically
+  container based formats, the magic detection may not be enough. (More
+  detail on detecting container formats below)
+
+  Tika is able to make use of a a mime magic info file, in the 
+  {{{http://www.freedesktop.org/standards/shared-mime-info}Freedesktop 
MIME-info}} 
+  format to peform mime magic detection. (Note that Tika supports a few
+  more match types than Freedesktop does)
+
+  This is provided within Tika by
+  
{{{./api/org/apache/tika/detect/MagicDetector.html}org.apache.tika.detect.MagicDetector}}.
 It is most commonly access via
+  
{{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}},
+  normally sourced from the <<<tika-mimetypes.xml>>> and 
<<<custom-mimetypes.xml>>>
+  files. For more information on defining your own custom mimetypes, see
+  {{{./parser_guide.html#Add_your_MIME-Type}the new parser guide}}.
+   
+
+* {Resource Name Based Detection}
+
+  Where the name of the file is known, it is sometimes possible to guess 
+  the file type from the name or extension. Within the 
+  <<<tika-mimetypes.xml>>> file is a list of patterns which are used to
+  identify the type from the filename.
+
+  However, because files may be renamed, this method of detection is quick
+  but not always as accurate.
+
+  This is provided within Tika by
+  
{{{./api/org/apache/tika/detect/NameDetector.html}org.apache.tika.detect.NameDetector}}.
+
+
+* {Known Content Type "Detection}
+
+  Sometimes, the mime type for a file is already known, such as when
+  downloading from a webserver, or when retrieving from a content store.
+  This information can be used by detectors, such as
+  
{{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}},
+
+
+* {The default Mime Types Detector}
+
+  By default, the mime type detection in Tika is provided by
+  
{{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}.
+  This detector makes use of <<<tika-mimetypes.xml>>> to power
+  magic based and filename based detection.
+
+  Firstly, magic based detection is used on the start of the file.
+  If the file is an XML file, then the start of the XML is processed
+  to look for root elements. Next, if available, the filename 
+  (from <<<TikaCoreProperties.RESOURCE_NAME_KEY>>>) is
+  then used to improve the detail of the detection, such as when magic
+  detects a text file, and the filename hints it's really a CSV. Finally,
+  if available, the supplied content type (from <<<Metadata.CONTENT_TYPE>>>)
+  is used to further refine the type.
+
+
+* {Container Aware Detection}
+
+  Several common file formats are actually held within a common container
+  format. One example is the PowerPoint .ppt and Word .doc formats, which
+  are both held within an OLE2 container. Another is Apple iWork formats,
+  which are actually a series of XML files within a Zip file.
+
+  Using magic detection, it is easy to spot that a given file is an OLE2
+  document, or a Zip file. Using magic detection alone, it is very difficult
+  (and often impossible) to tell what kind of file lives inside the container.
+
+  For some use cases, speed is important, so having a quick way to know the
+  container type is sufficient. For other cases however, you don't mind 
+  spending a bit of time (and memory!) processing the container to get a 
+  more accurate answer on its contents. For these cases, the additional
+  container aware detectors contained in the <<<Tika Parsers>>> jar should
+  be used.
+
+  Tika provides a wrapping detector in the form of 
+  
{{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}}.
+  This uses the service loader to discover all available detectors, including
+  any available container aware ones, and tries them in turn. For container
+  aware detection, include the <<<Tika Parsers>>> jar and its dependencies
+  in your project, then use DefaultDetector along with a <<<TikaInputStream>>>.
+
+  Because these container detectors needs to read the whole file to open and
+  inspect the container, they must be used with a 
+  
{{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}.
+  If called with a regular <<<InputStream>>>, then all work will be done
+  by the default Mime Magic detection only.
+
+  For more information on container formats and Tika, see
+  {{{http://wiki.apache.org/tika/MetadataDiscussion}}}
+
+
+* {The default Tika Detector}
+
+  Just as with Parsers, Tika provides a special detector
+  
{{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}}
+  which auto-detects (based on service files) the available detectors at 
+  runtime, and tries these in turn to identify the file type.
+
+  If only <<<Tika Core>>> is available, the Default Detector will work only
+  with Mime Magic and Resource Name detection. However, if <<<Tika Parsers>>>
+  (and its dependencies!) are available, additional detectors which known about
+  containers (such as zip and ole2) will be used as appropriate, provided that
+  detection is being performed with a
+  
{{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}.
+  Custom detectors can also be used as desired, they simply need to be listed
+  in a service file much as is done for
+  {{{./parser_guide.html#List_the_new_parser}custom parsers}}.
+
+
+* {Ways of triggering Detection}
+
+  The simplest way to detect is through the 
+  {{{./api/org/apache/tika/Tika.html}Tika Facade class}}, which provides 
methods to
+  detect based on
+  {{{./api/org/apache/tika/Tika.html##detect(java.io.File)}File}},
+  
{{{./api/org/apache/tika/Tika.html##detect(java.io.InputStream)}InputStream}},
+  {{{./api/org/apache/tika/Tika.html##detect(java.io.InputStream, 
java.lang.String)}InputStream and Filename}},
+  {{{./api/org/apache/tika/Tika.html##detect(java.lang.String)}Filename}} or a 
few others.
+  It works best with a File or 
+  {{{./api/org/apache/tika/io/TikaInputStream.html}TikaInputStream}}.
+
+  Alternately, detection can be performed on a specific Detector, or using
+  <<<DefaultDetector>>> to have all available Detectors used. A typical pattern
+  would be something like:
+
+---
+TikaConfig tika = new TikaConfig();
+
+for (File f : myListOfFiles) {
+   Metadata metadata = new Metadata();
+   //TikaInputStream sets the TikaCoreProperties.RESOURCE_NAME_KEY
+   //when initialized with a file or path
+   String mimetype = tika.getDetector().detect(
+      TikaInputStream.get(f, metadata), metadata);
+   System.out.println("File " + f + " is " + mimetype);
+}
+for (InputStream is : myListOfStreams) {
+   Metadata metadata = new Metadata();
+   //if you know the file name, it is a good idea to
+   //set it in the metadata, e.g.
+   //metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "somefile.pdf");
+   String mimetype = tika.getDetector().detect(
+        TikaInputStream.get(is), metadata);
+   System.out.println("Stream " + is + " is " + mimetype);
+}
+---
+
+* {Language Detection}
+
+  Tika is able to help identify the language of a piece of text, which
+  is useful when extracting text from document formats which do not include
+  language information in their metadata.
+
+  The language detection is provided by etensions of the 
+  
{{{./api/org/apache/tika/language/detect/LanguageDetector.html}org.apache.tika.language.detect.LanguageDetector}}.
+  This provides choice for developers looking to compare and contrast 
differing 
+  language detection implementations.
+
+  Some Java code example of language detection can be found at 
{{{https://github.com/apache/tika/blob/main/tika-example/src/main/java/org/apache/tika/example/LanguageDetectorExample.java}LanguageDetectorExample.java}},
 
+  
{{{https://github.com/apache/tika/blob/main/tika-example/src/main/java/org/apache/tika/example/LanguageDetectingParser.java}LanguageDetectingParser.java}}
 
+  and 
{{{https://github.com/apache/tika/blob/main/tika-example/src/main/java/org/apache/tika/example/Language.java}Language.java}}.
 
+
+* {More Examples}
+
+  For more examples of Detection using Apache Tika, please take a look at
+  the {{{./examples.html}Tika Examples page}}.

Added: tika/site/src/site/apt/2.6.0/examples.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/examples.apt?rev=1905121&view=auto
==============================================================================
--- tika/site/src/site/apt/2.6.0/examples.apt (added)
+++ tika/site/src/site/apt/2.6.0/examples.apt Mon Nov  7 11:40:42 2022
@@ -0,0 +1,148 @@
+                       -----------------------
+                       Tika API Usage Examples
+                       -----------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika API Usage Examples
+
+   This page provides a number of examples on how to use the various
+   Tika APIs. All of the examples shown are also available in the
+   {{{https://svn.apache.org/repos/asf/tika/trunk/tika-example}Tika Example
+    module}} in SVN.
+
+%{toc|section=1|fromDepth=1}
+
+
+* {Parsing}
+
+   Tika provides a number of different ways to parse a file. These provide 
+   different levels of control, flexibility, and complexity.
+
+** {Parsing using the Tika Facade}
+
+   The {{{./api/org/apache/tika/Tika.html}Tika facade}},
+   provides a number of very quick and easy ways to have your content
+   parsed by Tika, and return the resulting plain text
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ParsingExample.java|snippet=aj:..parseToStringExample()|show-gutter=false}
+
+** {Parsing using the Auto-Detect Parser}
+
+   For more control, you can call the
+   {{{./api/org/apache/tika/parser/Parser.html}Tika Parsers}}
+   directly. Most likely, you'll want to start out using the 
+   {{{./api/org/apache/tika/parser/AutoDetectParser.html}Auto-Detect Parser}},
+   which automatically figures out what kind of content you have, then calls 
the appropriate
+   parser for you.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ParsingExample.java|snippet=aj:..parseExample()|show-gutter=false}
+
+
+* {Picking different output formats}
+
+   With Tika, you can get the textual content of your files returned
+   in a number of different formats. These can be plain text, html, xhtml,
+   xhtml of one part of the file etc. This is controlled based on the
+   
{{{http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+   you supply to the Parser.
+
+** {Parsing to Plain Text}
+
+   By using the 
+   {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}},
+   you can request that Tika return only the content of the document's body as
+   a plain-text string.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToPlainText()|show-gutter=false}
+
+** {Parsing to XHTML}
+
+   By using the 
+   {{{./api/org/apache/tika/sax/ToXMLContentHandler.html}ToXMLContentHandler}},
+   you can get the XHTML content of the whole document as a string.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToHTML()|show-gutter=false}
+
+   If you just want the body of the xhtml document, without the header, you
+   can chain together a 
+   {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+   and a 
{{{./api/org/apache/tika/sax/ToXMLContentHandler.html}ToXMLContentHandler}}
+   as shown:
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseBodyToHTML()|show-gutter=false}
+
+** {Fetching just certain bits of the XHTML}
+
+   It possible to execute XPath queries on the parse results, to fetch
+   only certain bits of the XHTML. 
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseOnePartToHTML()|show-gutter=false}
+
+
+* {Custom Content Handlers}
+
+   The textual output of parsing a file with Tika is returned via the SAX 
+   
{{{http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+   you pass to the parse method. It is possible to customise your parsing by 
supplying your
+   own ContentHandler which does special things.
+
+** {Extract Phone Numbers from Content into the Metadata}
+
+   By using the 
+   
{{{./api/org/apache/tika/sax/PhoneExtractingContentHandler.html}PhoneExtractingContentHandler}},
+   you can have any phone numbers found in the textual content of the document 
extracted and placed
+   into the Metadata object for you.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/GrabPhoneNumbersExample.java|snippet=aj:..process(..File)|show-gutter=false}
+
+** {Streaming the plain text in chunks}
+
+   Sometimes, you want to chunk the resulting text up, perhaps to output
+   as you go minimising memory use, perhaps to output to HDFS files, or
+   any other reason! With a small custom content handler, you can do that.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToPlainTextChunks()|show-gutter=false}
+
+
+* {Translation}
+
+   Tika provides a pluggable Translation system, which allow you to send the 
results of
+   parsing off to an external system or program to have the text translated 
into another
+   language.
+
+** {Translation using the Microsoft Translation API}
+
+   In order to use the Microsoft Translation API, you need to sign up for a 
Microsoft account,
+   get an API key, then pass the key to Tika before translating.
+
+%{include|source=src/examples-src/main/java/org/apache/tika/example/TranslatorExample.java|snippet=aj:..microsoftTranslateToFrench(..String)|show-gutter=false}
+
+
+* {Language Identification}
+
+   Tika provides support for identifying the language of text, through the 
+   
{{{./api/org/apache/tika/language/LanguageIdentifier.html}LanguageIdentifier}} 
class.
+   
+%{include|source=src/examples-src/main/java/org/apache/tika/example/LanguageIdentifierExample.java|snippet=aj:..identifyLanguage(..String)|show-gutter=false}
+
+* {Additional Examples}
+
+   A number of other examples are also available, including all of the examples
+   from the {{{http://manning.com/mattmann/}Tika In Action book}}. These can 
all
+   be found in the
+   {{{https://svn.apache.org/repos/asf/tika/trunk/tika-example}Tika Example
+    module}} in SVN.

Added: tika/site/src/site/apt/2.6.0/formats.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/formats.apt?rev=1905121&view=auto
==============================================================================
--- tika/site/src/site/apt/2.6.0/formats.apt (added)
+++ tika/site/src/site/apt/2.6.0/formats.apt Mon Nov  7 11:40:42 2022
@@ -0,0 +1,1066 @@
+                       --------------------------
+                       Supported Document Formats
+                       --------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Supported Document Formats
+
+   This page lists all the document formats supported by the parsers in
+   Apache Tika 2.6.0. Follow the links to the various parser class javadocs
+   for more detailed information about each document format and how it is 
+   parsed by Tika.
+
+   <<Please note>> that Apache Tika is able to detect a much wider range of
+   formats than those listed below, this page only documents those formats
+   from which Tika is able to extract metadata and/or textual content.
+
+%{toc|fromDepth=1}
+
+* {HyperText Markup Language}
+
+   The HyperText Markup Language (HTML) is the lingua franca of the web.
+   Tika uses the {{{http://home.ccil.org/~cowan/XML/tagsoup/}TagSoup}}
+   library to support virtually any kind of HTML found on the web.
+   The output from the
+   {{{./api/org/apache/tika/parser/html/HtmlParser.html}HtmlParser}} class
+   is guaranteed to be well-formed and valid XHTML, and various heuristics
+   are used to prevent things like inline scripts from cluttering the
+   extracted text content.
+
+* {XML and derived formats}
+
+   The Extensible Markup Language (XML) format is a generic format that can
+   be used for all kinds of content. Tika has custom parsers for some widely
+   used XML vocabularies like XHTML, OOXML and ODF, but the default
+   {{{./api/org/apache/tika/parser/xml/DcXMLParser.html}DcXMLParser}}
+   class simply extracts the text content of the document and ignores any XML
+   structure. The only exception to this rule are Dublin Core metadata
+   elements that are used for the document metadata.
+
+* {Microsoft Office document formats}
+
+   Microsoft Office and some related applications produce documents in the
+   generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The
+   older OLE 2 format was introduced in Microsoft Office version 97 and was
+   the default format until Office version 2007 and the new XML-based
+   OOXML format. The
+   {{{./api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}}
+   and
+   
{{{./api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html}OOXMLParser}}
+   classes use {{{http://poi.apache.org/}Apache POI}} libraries to support
+   text and metadata extraction from both OLE2 and OOXML documents.
+
+   Old, pre-OLE2 Excel files (Excel 2, 3 and 4) are handled by the
+   
{{{./api/org/apache/tika/parser/microsoft/OldExcelParser.html}OldExcelParser}}.
+
+   The older, pre-OOXML pure-XML, office file formats are handled by
+   
{{{./api/org/apache/tika/parser/microsoft/xml/SpreadsheetMLParser.html}SpreadsheetMLParser}},
+   
{{{./api/org/apache/tika/parser/microsoft/xml/WordMLParser.html}WordMLParser}}
+   and
+   
{{{./api/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParser.html}Word2006MLParser}}.
+
+   Temporary Office lock files (owner files) are supported for basic metadata
+   extraction by
+   
{{{./api/org/apache/tika/parser/microsoft/MSOwnerFileParser.html}MSOwnerFileParser}}.
+
+* {OpenDocument Format}
+
+   The OpenDocument format (ODF) is used most notably as the default format
+   of the OpenOffice.org office suite. The
+   
{{{./api/org/apache/tika/parser/odf/OpenDocumentParser.html}OpenDocumentParser}}
+   class supports this format and the earlier OpenOffice 1.0 format on which
+   ODF is based.
+
+* {iWorks document formats}
+
+   The various iWorks document formats (Numbers, Pages, Keynote) are supported
+   by the 
+   
{{{./api/org/apache/tika/parser/iwork/IWorkPackageParser.html}IWorkPackageParser}}
+   class, which extracts text and metadata.
+
+* {WordPerfect document formats}
+
+   The Corel WordPerfect Office Suite formats are supported by
+   
{{{./api/org/apache/tika/parser/wordperfect/WordPerfectParser.html}WordPerfectParser}},
+   supporting WordPerfect WP6+ files, and
+   
{{{./api/org/apache/tika/parser/wordperfect/QuattroProParser.html}QuattroProParser}},
+   supporting QuattroPro QPW v9+ files.
+
+* {Portable Document Format}
+
+   The {{{./api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class
+   parsers Portable Document Format (PDF) documents using the
+   {{{http://pdfbox.apache.org/}Apache PDFBox}} library.
+
+* {Electronic Publication Format}
+
+   The {{{./api/org/apache/tika/parser/epub/EpubParser.html}EpubParser}} class
+   supports the Electronic Publication Format (EPUB) used for many digital
+   books.
+
+   The 
{{{./api/org/apache/tika/parser/xml/FictionBookParser.html}FictionBookParser}} 
class
+   supports the xml-based Fiction Book publishing format.
+
+* {Rich Text Format}
+
+   The {{{./api/org/apache/tika/parser/rtf/RTFParser.html}RTFParser}} class
+   uses the standard javax.swing.text.rtf feature to extract text content
+   from Rich Text Format (RTF) documents.
+
+* {Compression and packaging formats}
+
+   Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}}
+   library to support various compression and packaging formats. The
+   {{{./api/org/apache/tika/parser/pkg/CompressorParser.html}CompressorParser}}
+   class handles parsing of the top level compression formats, then
+   {{{./api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}}
+   class and its subclasses parse the packaging formats and then pass the 
+   unpacked document streams to a second parsing stage using the parser 
+   instance specified in the parse context. Formats supported include Tar, 
+   AR, ARJ, CPIO, Dump, Zip, 7Zip, Gzip, BZip2, XZ, LZMA, Z and Pack200.
+
+   Additionally, the
+   {{{./api/org/apache/tika/parser/pkg/RarParser.html}RarParser}} class
+   supports the RAR archive format, which isn't supported by Commons Compress.
+
+   The
+   
{{{./api/org/apache/tika/parser/apple/AppleSingleFileParser.html}AppleSingleFileParser}}
+   class supports resources packaged within AppleSingle and AppleDouble
+   files.
+
+* {Text formats}
+
+   Extracting text content from plain text files seems like a simple task
+   until you start thinking of all the possible character encodings. The
+   {{{./api/org/apache/tika/parser/txt/TXTParser.html}TXTParser}} class uses
+   encoding detection code from the {{{http://site.icu-project.org/}ICU}}
+   project to automatically detect the character encoding of a text document.
+
+* {Feed and Syndication formats}
+
+   The {{{./api/org/apache/tika/parser/feed/FeedParser.html}FeedParser}} class
+   supports the RSS and Atom feed syndication formats.
+
+   The 
{{{./api/org/apache/tika/parser/iptc/IptcAnpaParser.html}IptcAnpaParser}} class
+   supports the IPTC ANPA News Wire feed format.
+
+* {Help formats}
+
+   The {{{./api/org/apache/tika/parser/chm/ChmParser.html}ChmParser}} class
+   supports the CHM Help format.
+
+* {Audio formats}
+
+   Tika can detect several common audio formats and extract metadata
+   from them. Even text extraction is supported for some audio files that
+   contain lyrics or other textual content. Extracted metadata includes
+   sampling rates, channels, format information, artists, titles etc. The
+   {{{./api/org/apache/tika/parser/audio/AudioParser.html}AudioParser}}
+   and {{{./api/org/apache/tika/parser/audio/MidiParser.html}MidiParser}}
+   classes use standard javax.sound features to process simple audio
+   formats. The
+   {{{./api/org/apache/tika/parser/mp3/Mp3Parser.html}Mp3Parser}} class
+   adds support for the widely used MP3 format, and the
+   {{{./api/org/apache/tika/parser/mp4/MP4Parser.html}MP4Parser}} class
+   provides it for MP4 audio. The Ogg family of audio formats (Vorbis,
+   Speex, Opus, Flac etc) are supported by the
+   {{{./api/org/gagravarr/tika/VorbisParser.html}VorbisParser}},
+   {{{./api/org/gagravarr/tika/OpusParser.html}OpusParser}},
+   {{{./api/org/gagravarr/tika/SpeexParser.html}SpeexParser}} and
+   {{{./api/org/gagravarr/tika/FlacParser.html}FlacParser}}
+   classes.
+
+* {Image formats}
+
+   The {{{./api/org/apache/tika/parser/image/ImageParser.html}ImageParser}}
+   class uses the standard javax.imageio feature to extract simple metadata
+   from image formats supported by the Java platform, such as PNG, GIF
+   and BMP. More complex image metadata is available through the
+   {{{./api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class and
+   {{{./api/org/apache/tika/parser/image/TiffParser.html}TiffParser}} classes
+   that uses the metadata-extractor library to supports Exif metadata
+   extraction from Jpeg and Tiff images. The 
+   {{{./api/org/apache/tika/parser/image/PSDParser.html}PSDParser}} class
+   extracts metadata from PSD images. The
+   {{{./api/org/apache/tika/parser/image/BPGParser.html}BPGParser}} class
+   extracts simple metadata from BPG (Better Portable Graphics) images.
+   The {{{./api/org/apache/tika/parser/image/WebPParser.html}WebPParser}} 
+   class extracts simple metadata from WebP image format.
+   The {{{./api/org/apache/tika/parser/image/ICNSParser.html}ICNSParser}} 
+   class extracts simple metadata from the Apple ICNS icon image format.
+
+   When extracting from images, it is also possible to chain in Tesseract, via
+   the 
{{{./api/org/apache/tika/parser/ocr/TesseractOCRParser.html}TesseractOCRParser}},
+   to have OCR performed on the contents of the image.
+
+   The {{{./api/org/apache/tika/parser/microsoft/WMFParser.html}WMFParser}}
+   class extracts simple text from Microsoft WMF drawings.
+   The {{{./api/org/apache/tika/parser/microsoft/EMFParser.html}EMFParser}}
+   class extracts simple text from Microsoft EMF drawings, along with
+   exposing any embedded other resources / files.
+
+* {Video formats}
+
+   Tika supports the Flash video format using a simple parsing algorithm 
+   implemented in the
+   {{{./api/org/apache/tika/parser/video/FLVParser}FLVParser}} class.
+
+   The MP4 family of video formats (MP4, Quicktime, 3GPP etc) is supported 
+   by the {{{./api/org/apache/tika/parser/mp4/MP4Parser}MP4Parser}} class,
+   which extracts metadata on the video, along with audio stream
+   (if present).
+
+   For the Ogg family of video formats, a limited amount of metadata is
+   extracted by the 
+   {{{./api/org/gagravarr/tika/OggParser.html}OggParser}} class. There is
+   also an experimental
+   {{{./api/org/gagravarr/tika/TheoraParser.html}TheoraParser}} class which
+   extracts only limited metadata, pending a consensus on the "right" way
+   to return metadata for audio streams along with the video metadata.
+
+   As an alternative to the metadata-focused parsers above, the
+   
{{{./api/org/apache/tika/parser/pot/PooledTimeSeriesParser}PooledTimeSeriesParser}}
+   can be used (if the required tool is installed) to generate a numeric
+   representation of the video suitable for similarity searches. More details
+   on this approach, and setup instructions for the parser + tool, can be
+   found on {{{https://wiki.apache.org/tika/PooledTimeSeriesParser}the Tika
+   wiki page for the parser}}.
+
+* {Java class files and archives}
+
+   The {{{./api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class
+   extracts class names and method signatures from Java class files, and
+   the {{{./api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}} 
class
+   supports also jar archives.
+
+* {Source code}
+
+   The 
{{{./api/org/apache/tika/parser/code/SourceCodeParser}SourceCodeParser}} class
+   handles a number of source code formats, including Java, C, C++ and Groovy.
+   It provides a formatted form of the code, along with some simple metadata.
+
+* {Mail formats}
+
+   The {{{./api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can
+   extract email messages from the mbox format used by many email archives
+   and Unix-style mailboxes.
+
+   The {{{./api/org/apache/tika/parser/mail/RFC822Parser.html}RFC822Parser}} 
can
+   process single email messages in the RFC 822 format used by many email 
clients
+   in their archives / exports.
+
+   The 
{{{./api/org/apache/tika/parser/mbox/OutlookPSTParser.html}OutlookPSTParser}} 
can
+   extract email messages from the Microsoft Outlook PST email format.
+
+   The 
{{{./api/org/apache/tika/parser/microsoft/OutlookExtractor.html}OutlookExtractor}}
 (part of 
+   {{{./api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}})
+   is able to extract email messages from the Microsoft Outlook MSG email
+   format.
+
+   The {{{./api/org/apache/tika/parser/microsoft/TNEFParser.html}TNEFParser}} 
can
+   extract email attachments from the Microsoft TNEF (Transport Neutral 
Encoding
+   Format, aka Winmail.dat) used with some Microsoft email clients.
+
+* {CAD formats}
+
+   The {{{./api/org/apache/tika/parser/dwg/DWGParser.html}DWGParser}} can
+   extract simple metadata from the DWG CAD format.
+
+* {Font formats}
+
+   The 
{{{./api/org/apache/tika/parser/font/TrueTypeParser.html}TrueTypeParser}} 
+   class can extract simple metadata from the TrueType font format.
+   The 
{{{./api/org/apache/tika/parser/font/AdobeFontMetricParser.html}AdobeFontMetricParser}}
 
+   class does something similar for Adobe Font Metrics files.
+
+* {Scientific formats}
+
+   The {{{./api/org/apache/tika/parser/dif/DIFParser.html}DIFParser}}
+   is able to extract attribute metadata from the GCMD Directory 
+   Interchange Format (DIF) scientific file format.
+
+   The {{{./api/org/apache/tika/parser/gdal/GDALParser.html}GDALParser}}
+   is able to extract attribute metadata from the GDAL scientific file format.
+
+   The 
{{{./api/org/apache/tika/parser/geoinfo/GeographicInformationParser.html}GeographicInformationParser}}
+   is able to extract attribute metadata from the ISO-19139 georgraphic 
+   information file format.
+
+   The {{{./api/org/apache/tika/parser/geo/topic/GeoParser.html}GeoParser}}
+   is makes use of a pre-built collection of a geographic gazetteer, to 
+   resolve geographic entities into their positions into the metadata
+
+   The {{{./api/org/apache/tika/parser/grib/GribParser.html}GribParser}}
+   is able to extract attribute metadata from the Grib scientific file format.
+
+   The {{{./api/org/apache/tika/parser/hdf/HDFParser.html}HDFParser}}
+   is able to extract attribute metadata from the HDF scientific file format.
+
+   The 
{{{./api/org/apache/tika/parser/isatab/ISArchiveParser.html}ISArchiveParser}}
+   is able to extract attribute metadata from the ISA-Tab (ISA Tools) family of
+   scientific file formats.
+
+   The {{{./api/org/apache/tika/parser/netcdf/NetCDFParser.html}NetCDFParser}}
+   is able to extract attribute metadata from the NetCDF scientific file 
format.
+
+   The {{{./api/org/apache/tika/parser/mat/MatParser.html}MatParser}}
+   is able to extract attribute metadata from the Matlab scientific file 
format.
+
+* {Executable programs and libraries}
+
+   The 
{{{./api/org/apache/tika/parser/executable/ExecutableParser.html}ExecutableParser}}
 can
+   extract metadata information on platforms, architectures and types from a 
range
+   of executable formats and libraries, such as Windows Executables and Linux 
/ BSD 
+   programs and libraries.
+
+* {Crypto formats}
+
+   The {{{./api/org/apache/tika/parser/crypto/Pkcs7Parser.html}Pkcs7Parser}} 
is able to
+   parse the contents of PKCS7 signed messages, but doesn't include any 
information from
+   the outer PKCS7 wrapper.
+
+   The {{{./api/org/apache/tika/parser/crypto/TSDParser.html}TSDParser}} class
+   processes metadata from Time Stamped Data Envelope files, as well as 
exposing the
+   contents stored within the TSD wrapper.
+
+* {Database formats}
+
+   The {{{./api/org/apache/tika/parser/jdbc/SQLite3Parser.html}SQLite3Parser}} 
is able to
+   extract content from SQLite3 files, in a tabular form. However, it requires 
that the
+   {{{http://xerial.org/software/}org.xerial sqlite-jdbc jar}} is manually 
added to 
+   the classpath first, as that binary jar isn't shipped as standard.
+
+   The 
{{{./api/org/apache/tika/parser/microsoft/JackcessParser.html}JackcessParser}} 
is 
+   able to extract metadata and content in a tabular form, from Microsoft 
Access 
+   database files.
+
+   The {{{./api/org/apache/tika/parser/dbf/DBFParser.html}DBFParser}} currently
+   supports versions of dBase files (dbf) before version 7. dBase formats are 
+   used in many legacy database systems, including
+   dBase, FoxBASE, FoxPRO and in ESRI's Shapefile format.  See
+   {{{http://www.digitalpreservation.gov/formats/fdd/fdd000325.shtml} 
digitalpreservation.gov}}
+   for background on this format.
+
+* {Natural Language Processing}
+
+   Tika supports calling out to a number of Natural Language Processing and
+   Named Entity Recognition frameworks, tools and libraries. 
+
+   These can be used to support additional formats, or to gain extra 
information on 
+   existing formats. In many cases, additional tools or REST services or 
training 
+   datasets are required to enable or power this support.
+
+   Details on the requirements and setup steps are generally given either in
+   the parser's javadocs, or on the {{{https://wiki.apache.org/tika/}Tika 
wiki}}.
+
+   The 
{{{./api/org/apache/tika/parser/sentiment/analysis/SentimentParser.html}SentimentParser}}
+   class classifies documents based on the sentiment of document, powered by 
Apache 
+   OpenNLP's Maximum Entropy Classifier.
+
+   {{{./api/org/apache/tika/parser/journal/JournalParser.html}JournalParser}} 
uses
+   Grobid (via RESTful server) to extract additional metadata from the text of
+   journal publications. A number of other NLP and NER parsers are available 
in the
+   {{{./api/org/apache/tika/parser/ner/}ner package}}
+
+* {Image and Video object recognition}
+
+   Tika supports calling out to a number of Object Recognition frameworks to
+   analyse the contents of images and videos. Large training datasets and or
+   frameworks are generally required, often accessed via REST services. The
+   {{{./api/org/apache/tika/parser/recognition/}recognition package}} contains
+   most of these. Details on the requirements and setup steps are generally 
given
+   on the {{{https://wiki.apache.org/tika/}Tika wiki}}.
+
+
+Full list of Supported Formats in "standard" artifacts
+
+   * 
org.apache.tika.parser.apple.{{{./api/org/apache/tika/parser/apple/AppleSingleFileParser}AppleSingleFileParser}}
+
+      * application/applefile
+
+   * 
org.apache.tika.parser.apple.{{{./api/org/apache/tika/parser/apple/PListParser}PListParser}}
+
+      * application/x-plist
+
+      * application/x-bplist-itunes
+
+      * application/x-bplist
+
+      * application/x-bplist-memgraph
+
+      * application/x-bplist-webarchive
+
+   * 
org.apache.tika.parser.asm.{{{./api/org/apache/tika/parser/asm/ClassParser}ClassParser}}
+
+      * application/java-vm
+
+   * 
org.apache.tika.parser.audio.{{{./api/org/apache/tika/parser/audio/AudioParser}AudioParser}}
+
+      * audio/vnd.wave
+
+      * audio/x-wav
+
+      * audio/basic
+
+      * audio/x-aiff
+
+   * 
org.apache.tika.parser.audio.{{{./api/org/apache/tika/parser/audio/MidiParser}MidiParser}}
+
+      * application/x-midi
+
+      * audio/midi
+
+   * 
org.apache.tika.parser.code.{{{./api/org/apache/tika/parser/code/SourceCodeParser}SourceCodeParser}}
+
+      * text/x-c++src
+
+      * text/x-groovy
+
+      * text/x-java-source
+
+   * 
org.apache.tika.parser.crypto.{{{./api/org/apache/tika/parser/crypto/Pkcs7Parser}Pkcs7Parser}}
+
+      * application/pkcs7-signature
+
+      * application/pkcs7-mime
+
+   * 
org.apache.tika.parser.crypto.{{{./api/org/apache/tika/parser/crypto/TSDParser}TSDParser}}
+
+      * application/timestamped-data
+
+   * 
org.apache.tika.parser.csv.{{{./api/org/apache/tika/parser/csv/TextAndCSVParser}TextAndCSVParser}}
+
+      * text/csv
+
+      * text/tsv
+
+      * text/plain
+
+   * 
org.apache.tika.parser.dbf.{{{./api/org/apache/tika/parser/dbf/DBFParser}DBFParser}}
+
+      * application/x-dbf
+
+   * 
org.apache.tika.parser.dgn.{{{./api/org/apache/tika/parser/dgn/DGN8Parser}DGN8Parser}}
+
+      * image/vnd.dgn; version=8
+
+   * 
org.apache.tika.parser.dif.{{{./api/org/apache/tika/parser/dif/DIFParser}DIFParser}}
+
+      * application/dif+xml
+
+   * 
org.apache.tika.parser.dwg.{{{./api/org/apache/tika/parser/dwg/DWGParser}DWGParser}}
+
+      * image/vnd.dwg
+
+   * 
org.apache.tika.parser.epub.{{{./api/org/apache/tika/parser/epub/EpubParser}EpubParser}}
+
+      * application/x-ibooks+zip
+
+      * application/epub+zip
+
+   * 
org.apache.tika.parser.executable.{{{./api/org/apache/tika/parser/executable/ExecutableParser}ExecutableParser}}
+
+      * application/x-msdownload
+
+      * application/x-sharedlib
+
+      * application/x-elf
+
+      * application/x-object
+
+      * application/x-executable
+
+      * application/x-coredump
+
+   * 
org.apache.tika.parser.feed.{{{./api/org/apache/tika/parser/feed/FeedParser}FeedParser}}
+
+      * application/atom+xml
+
+      * application/rss+xml
+
+   * 
org.apache.tika.parser.font.{{{./api/org/apache/tika/parser/font/AdobeFontMetricParser}AdobeFontMetricParser}}
+
+      * application/x-font-adobe-metric
+
+   * 
org.apache.tika.parser.font.{{{./api/org/apache/tika/parser/font/TrueTypeParser}TrueTypeParser}}
+
+      * application/x-font-ttf
+
+   * 
org.apache.tika.parser.html.{{{./api/org/apache/tika/parser/html/HtmlParser}HtmlParser}}
+
+      * text/html
+
+      * application/vnd.wap.xhtml+xml
+
+      * application/x-asp
+
+      * application/xhtml+xml
+
+   * 
org.apache.tika.parser.http.{{{./api/org/apache/tika/parser/http/HttpParser}HttpParser}}
+
+      * application/x-httpresponse
+
+   * 
org.apache.tika.parser.hwp.{{{./api/org/apache/tika/parser/hwp/HwpV5Parser}HwpV5Parser}}
+
+      * application/x-hwp-v5
+
+   * 
org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/BPGParser}BPGParser}}
+
+      * image/bpg
+
+      * image/x-bpg
+
+   * 
org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/HeifParser}HeifParser}}
+
+      * image/heic-sequence
+
+      * image/heif
+
+      * image/heic
+
+      * image/heif-sequence
+
+   * 
org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/ICNSParser}ICNSParser}}
+
+      * image/icns
+
+   * 
org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/ImageParser}ImageParser}}
+
+      * image/png
+
+      * image/vnd.wap.wbmp
+
+      * image/x-jbig2
+
+      * image/bmp
+
+      * image/x-xcf
+
+      * image/gif
+
+      * image/x-icon
+
+      * image/x-ms-bmp
+
+   * 
org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/JXLParser}JXLParser}}
+
+      * image/jxl
+
+   * 
org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/JpegParser}JpegParser}}
+
+      * image/jpeg
+
+   * 
org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/PSDParser}PSDParser}}
+
+      * image/vnd.adobe.photoshop
+
+   * 
org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/TiffParser}TiffParser}}
+
+      * image/tiff
+
+   * 
org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/WebPParser}WebPParser}}
+
+      * image/webp
+
+   * 
org.apache.tika.parser.indesign.{{{./api/org/apache/tika/parser/indesign/IDMLParser}IDMLParser}}
+
+      * application/vnd.adobe.indesign-idml-package
+
+   * 
org.apache.tika.parser.iptc.{{{./api/org/apache/tika/parser/iptc/IptcAnpaParser}IptcAnpaParser}}
+
+      * text/vnd.iptc.anpa
+
+   * 
org.apache.tika.parser.iwork.{{{./api/org/apache/tika/parser/iwork/IWorkPackageParser}IWorkPackageParser}}
+
+      * application/vnd.apple.keynote
+
+      * application/vnd.apple.iwork
+
+      * application/vnd.apple.numbers
+
+      * application/vnd.apple.pages
+
+   * 
org.apache.tika.parser.iwork.iwana.{{{./api/org/apache/tika/parser/iwork/iwana/IWork13PackageParser}IWork13PackageParser}}
+
+      * application/vnd.apple.numbers.13
+
+      * application/vnd.apple.unknown.13
+
+      * application/vnd.apple.pages.13
+
+      * application/vnd.apple.keynote.13
+
+   * 
org.apache.tika.parser.iwork.iwana.{{{./api/org/apache/tika/parser/iwork/iwana/IWork18PackageParser}IWork18PackageParser}}
+
+      * application/vnd.apple.pages.18
+
+      * application/vnd.apple.keynote.18
+
+      * application/vnd.apple.numbers.18
+
+   * 
org.apache.tika.parser.mail.{{{./api/org/apache/tika/parser/mail/RFC822Parser}RFC822Parser}}
+
+      * message/rfc822
+
+   * 
org.apache.tika.parser.mat.{{{./api/org/apache/tika/parser/mat/MatParser}MatParser}}
+
+      * application/x-matlab-data
+
+   * 
org.apache.tika.parser.mbox.{{{./api/org/apache/tika/parser/mbox/MboxParser}MboxParser}}
+
+      * application/mbox
+
+   * 
org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/EMFParser}EMFParser}}
+
+      * image/emf
+
+   * 
org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/JackcessParser}JackcessParser}}
+
+      * application/x-msaccess
+
+   * 
org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/MSOwnerFileParser}MSOwnerFileParser}}
+
+      * application/x-ms-owner
+
+   * 
org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/OfficeParser}OfficeParser}}
+
+      * application/x-tika-msoffice-embedded; format=ole10_native
+
+      * application/msword
+
+      * application/vnd.visio
+
+      * application/x-tika-ole-drm-encrypted
+
+      * application/vnd.ms-project
+
+      * application/x-tika-msworks-spreadsheet
+
+      * application/x-mspublisher
+
+      * application/vnd.ms-powerpoint
+
+      * application/x-tika-msoffice
+
+      * application/sldworks
+
+      * application/x-tika-ooxml-protected
+
+      * application/vnd.ms-excel
+
+      * application/vnd.ms-outlook
+
+   * 
org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/OldExcelParser}OldExcelParser}}
+
+      * application/vnd.ms-excel.workspace.3
+
+      * application/vnd.ms-excel.workspace.4
+
+      * application/vnd.ms-excel.sheet.2
+
+      * application/vnd.ms-excel.sheet.3
+
+      * application/vnd.ms-excel.sheet.4
+
+   * 
org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/TNEFParser}TNEFParser}}
+
+      * application/vnd.ms-tnef
+
+      * application/x-tnef
+
+      * application/ms-tnef
+
+   * 
org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/WMFParser}WMFParser}}
+
+      * image/wmf
+
+   * 
org.apache.tika.parser.microsoft.chm.{{{./api/org/apache/tika/parser/microsoft/chm/ChmParser}ChmParser}}
+
+      * application/vnd.ms-htmlhelp
+
+      * application/x-chm
+
+      * application/chm
+
+   * 
org.apache.tika.parser.microsoft.onenote.{{{./api/org/apache/tika/parser/microsoft/onenote/OneNoteParser}OneNoteParser}}
+
+      * application/onenote; format=one
+
+   * 
org.apache.tika.parser.microsoft.ooxml.{{{./api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser}OOXMLParser}}
+
+      * application/vnd.ms-powerpoint.template.macroenabled.12
+
+      * application/vnd.ms-excel.addin.macroenabled.12
+
+      * application/vnd.openxmlformats-officedocument.wordprocessingml.template
+
+      * application/vnd.ms-excel.sheet.binary.macroenabled.12
+
+      * application/vnd.openxmlformats-officedocument.wordprocessingml.document
+
+      * application/vnd.ms-powerpoint.slide.macroenabled.12
+
+      * application/vnd.ms-visio.drawing
+
+      * application/vnd.ms-powerpoint.slideshow.macroenabled.12
+
+      * application/vnd.ms-powerpoint.presentation.macroenabled.12
+
+      * application/vnd.openxmlformats-officedocument.presentationml.slide
+
+      * application/vnd.ms-excel.sheet.macroenabled.12
+
+      * application/vnd.ms-word.template.macroenabled.12
+
+      * application/vnd.ms-word.document.macroenabled.12
+
+      * application/vnd.ms-powerpoint.addin.macroenabled.12
+
+      * application/vnd.openxmlformats-officedocument.spreadsheetml.template
+
+      * application/vnd.ms-xpsdocument
+
+      * application/vnd.ms-visio.drawing.macroenabled.12
+
+      * application/vnd.ms-visio.template.macroenabled.12
+
+      * model/vnd.dwfx+xps
+
+      * application/vnd.openxmlformats-officedocument.presentationml.template
+
+      * 
application/vnd.openxmlformats-officedocument.presentationml.presentation
+
+      * application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
+
+      * application/vnd.ms-visio.stencil
+
+      * application/vnd.ms-visio.template
+
+      * application/vnd.openxmlformats-officedocument.presentationml.slideshow
+
+      * application/vnd.ms-visio.stencil.macroenabled.12
+
+      * application/vnd.ms-excel.template.macroenabled.12
+
+   * 
org.apache.tika.parser.microsoft.ooxml.xwpf.ml2006.{{{./api/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParser}Word2006MLParser}}
+
+      * application/vnd.ms-word2006ml
+
+   * 
org.apache.tika.parser.microsoft.pst.{{{./api/org/apache/tika/parser/microsoft/pst/OutlookPSTParser}OutlookPSTParser}}
+
+      * application/vnd.ms-outlook-pst
+
+   * 
org.apache.tika.parser.microsoft.rtf.{{{./api/org/apache/tika/parser/microsoft/rtf/RTFParser}RTFParser}}
+
+      * application/rtf
+
+   * 
org.apache.tika.parser.microsoft.xml.{{{./api/org/apache/tika/parser/microsoft/xml/SpreadsheetMLParser}SpreadsheetMLParser}}
+
+      * application/vnd.ms-spreadsheetml
+
+   * 
org.apache.tika.parser.microsoft.xml.{{{./api/org/apache/tika/parser/microsoft/xml/WordMLParser}WordMLParser}}
+
+      * application/vnd.ms-wordml
+
+   * 
org.apache.tika.parser.mif.{{{./api/org/apache/tika/parser/mif/MIFParser}MIFParser}}
+
+      * application/x-mif
+
+      * application/vnd.mif
+
+      * application/x-maker
+
+   * 
org.apache.tika.parser.mp3.{{{./api/org/apache/tika/parser/mp3/Mp3Parser}Mp3Parser}}
+
+      * audio/mpeg
+
+   * 
org.apache.tika.parser.mp4.{{{./api/org/apache/tika/parser/mp4/MP4Parser}MP4Parser}}
+
+      * video/x-m4v
+
+      * application/mp4
+
+      * video/3gpp
+
+      * video/3gpp2
+
+      * video/quicktime
+
+      * audio/mp4
+
+      * video/mp4
+
+   * 
org.apache.tika.parser.ocr.{{{./api/org/apache/tika/parser/ocr/TesseractOCRParser}TesseractOCRParser}}
+
+      * image/ocr-x-portable-pixmap
+
+      * image/ocr-jpx
+
+      * image/x-portable-pixmap
+
+      * image/ocr-jpeg
+
+      * image/ocr-jp2
+
+      * image/jpx
+
+      * image/ocr-png
+
+      * image/ocr-tiff
+
+      * image/ocr-gif
+
+      * image/ocr-bmp
+
+      * image/jp2
+
+   * 
org.apache.tika.parser.odf.{{{./api/org/apache/tika/parser/odf/FlatOpenDocumentParser}FlatOpenDocumentParser}}
+
+      * application/vnd.oasis.opendocument.tika.flat.document
+
+      * application/vnd.oasis.opendocument.flat.presentation
+
+      * application/vnd.oasis.opendocument.flat.spreadsheet
+
+      * application/vnd.oasis.opendocument.flat.text
+
+   * 
org.apache.tika.parser.odf.{{{./api/org/apache/tika/parser/odf/OpenDocumentParser}OpenDocumentParser}}
+
+      * application/x-vnd.oasis.opendocument.presentation
+
+      * application/vnd.oasis.opendocument.chart
+
+      * application/x-vnd.oasis.opendocument.text-web
+
+      * application/x-vnd.oasis.opendocument.image
+
+      * application/vnd.oasis.opendocument.graphics-template
+
+      * application/vnd.oasis.opendocument.text-web
+
+      * application/x-vnd.oasis.opendocument.spreadsheet-template
+
+      * application/vnd.oasis.opendocument.spreadsheet-template
+
+      * application/vnd.sun.xml.writer
+
+      * application/x-vnd.oasis.opendocument.graphics-template
+
+      * application/vnd.oasis.opendocument.graphics
+
+      * application/vnd.oasis.opendocument.spreadsheet
+
+      * application/x-vnd.oasis.opendocument.chart
+
+      * application/x-vnd.oasis.opendocument.spreadsheet
+
+      * application/vnd.oasis.opendocument.image
+
+      * application/x-vnd.oasis.opendocument.text
+
+      * application/x-vnd.oasis.opendocument.text-template
+
+      * application/vnd.oasis.opendocument.formula-template
+
+      * application/x-vnd.oasis.opendocument.formula
+
+      * application/vnd.oasis.opendocument.image-template
+
+      * application/x-vnd.oasis.opendocument.image-template
+
+      * application/x-vnd.oasis.opendocument.presentation-template
+
+      * application/vnd.oasis.opendocument.presentation-template
+
+      * application/vnd.oasis.opendocument.text
+
+      * application/vnd.oasis.opendocument.text-template
+
+      * application/vnd.oasis.opendocument.chart-template
+
+      * application/x-vnd.oasis.opendocument.chart-template
+
+      * application/x-vnd.oasis.opendocument.formula-template
+
+      * application/x-vnd.oasis.opendocument.text-master
+
+      * application/vnd.oasis.opendocument.presentation
+
+      * application/x-vnd.oasis.opendocument.graphics
+
+      * application/vnd.oasis.opendocument.formula
+
+      * application/vnd.oasis.opendocument.text-master
+
+   * 
org.apache.tika.parser.pdf.{{{./api/org/apache/tika/parser/pdf/PDFParser}PDFParser}}
+
+      * application/pdf
+
+   * 
org.apache.tika.parser.pkg.{{{./api/org/apache/tika/parser/pkg/CompressorParser}CompressorParser}}
+
+      * application/zlib
+
+      * application/x-gzip
+
+      * application/x-bzip2
+
+      * application/x-compress
+
+      * application/x-java-pack200
+
+      * application/x-lzma
+
+      * application/deflate64
+
+      * application/x-lz4
+
+      * application/x-snappy
+
+      * application/x-brotli
+
+      * application/gzip
+
+      * application/x-bzip
+
+      * application/x-xz
+
+   * 
org.apache.tika.parser.pkg.{{{./api/org/apache/tika/parser/pkg/PackageParser}PackageParser}}
+
+      * application/x-tar
+
+      * application/java-archive
+
+      * application/x-arj
+
+      * application/x-archive
+
+      * application/zip
+
+      * application/x-cpio
+
+      * application/x-tika-unix-dump
+
+      * application/x-7z-compressed
+
+   * 
org.apache.tika.parser.pkg.{{{./api/org/apache/tika/parser/pkg/RarParser}RarParser}}
+
+      * application/x-rar-compressed
+
+   * 
org.apache.tika.parser.prt.{{{./api/org/apache/tika/parser/prt/PRTParser}PRTParser}}
+
+      * application/x-prt
+
+   * 
org.apache.tika.parser.sas.{{{./api/org/apache/tika/parser/sas/SAS7BDATParser}SAS7BDATParser}}
+
+      * application/x-sas-data
+
+   * 
org.apache.tika.parser.tmx.{{{./api/org/apache/tika/parser/tmx/TMXParser}TMXParser}}
+
+      * application/x-tmx
+
+   * 
org.apache.tika.parser.video.{{{./api/org/apache/tika/parser/video/FLVParser}FLVParser}}
+
+      * video/x-flv
+
+   * 
org.apache.tika.parser.wacz.{{{./api/org/apache/tika/parser/wacz/WACZParser}WACZParser}}
+
+      * application/x-wacz
+
+   * 
org.apache.tika.parser.warc.{{{./api/org/apache/tika/parser/warc/WARCParser}WARCParser}}
+
+      * application/warc
+
+   * 
org.apache.tika.parser.wordperfect.{{{./api/org/apache/tika/parser/wordperfect/QuattroProParser}QuattroProParser}}
+
+      * application/x-quattro-pro; version=9
+
+   * 
org.apache.tika.parser.wordperfect.{{{./api/org/apache/tika/parser/wordperfect/WordPerfectParser}WordPerfectParser}}
+
+      * application/vnd.wordperfect; version=5.1
+
+      * application/vnd.wordperfect; version=5.0
+
+      * application/vnd.wordperfect; version=6.x
+
+   * 
org.apache.tika.parser.xliff.{{{./api/org/apache/tika/parser/xliff/XLIFF12Parser}XLIFF12Parser}}
+
+      * application/x-xliff+xml
+
+   * 
org.apache.tika.parser.xliff.{{{./api/org/apache/tika/parser/xliff/XLZParser}XLZParser}}
+
+      * application/x-xliff+zip
+
+   * 
org.apache.tika.parser.xml.{{{./api/org/apache/tika/parser/xml/DcXMLParser}DcXMLParser}}
+
+      * application/xml
+
+      * image/svg+xml
+
+   * 
org.apache.tika.parser.xml.{{{./api/org/apache/tika/parser/xml/FictionBookParser}FictionBookParser}}
+
+      * application/x-fictionbook+xml
+
+   * org.gagravarr.tika.{{{./api/org/gagravarr/tika/FlacParser}FlacParser}}
+
+      * audio/x-oggflac
+
+      * audio/x-flac
+
+   * org.gagravarr.tika.{{{./api/org/gagravarr/tika/OggParser}OggParser}}
+
+      * audio/ogg
+
+      * application/kate
+
+      * application/ogg
+
+      * video/daala
+
+      * video/x-ogguvs
+
+      * video/x-ogm
+
+      * audio/x-oggpcm
+
+      * video/ogg
+
+      * video/x-dirac
+
+      * video/x-oggrgb
+
+      * video/x-oggyuv
+
+   * org.gagravarr.tika.{{{./api/org/gagravarr/tika/OpusParser}OpusParser}}
+
+      * audio/opus
+
+      * audio/ogg; codecs=opus
+
+   * org.gagravarr.tika.{{{./api/org/gagravarr/tika/SpeexParser}SpeexParser}}
+
+      * audio/ogg; codecs=speex
+
+      * audio/speex
+
+   * org.gagravarr.tika.{{{./api/org/gagravarr/tika/TheoraParser}TheoraParser}}
+
+      * video/theora
+
+   * org.gagravarr.tika.{{{./api/org/gagravarr/tika/VorbisParser}VorbisParser}}
+
+      * audio/vorbis
+

Added: tika/site/src/site/apt/2.6.0/gettingstarted.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/gettingstarted.apt?rev=1905121&view=auto
==============================================================================
--- tika/site/src/site/apt/2.6.0/gettingstarted.apt (added)
+++ tika/site/src/site/apt/2.6.0/gettingstarted.apt Mon Nov  7 11:40:42 2022
@@ -0,0 +1,324 @@
+                     --------------------------------
+                     Getting Started with Apache Tika
+                     --------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Getting Started with Apache Tika
+
+ This document describes how to build Apache Tika from sources and
+ how to start using Tika in an application.
+
+Getting and building the sources
+
+ To build Tika from sources you first need to either
+ {{{../download.html}download}} a source release or
+ {{{../contribute.html#Source_Code}checkout}} the latest sources from
+ version control.
+
+ Once you have the sources, you can build them using the
+ {{{http://maven.apache.org/}Maven 2}} build system. Executing the
+ following command in the base directory will build the sources
+ and install the resulting artifacts in your local Maven repository.
+
+---
+mvn install
+---
+
+ If you want to build only the app or the server with the standard parsers,
+ you can save time with:
+
+---
+mvn install -am -pl :tika-app
+---
+ Or:
+
+---
+mvn install -am -pl :tika-server-standard
+---
+
+ See the Maven documentation for more information about the available
+ build options.
+
+ Note that you need Java 8 or higher to build Tika. For a full build, you'll 
also need to have Docker installed.
+
+Build artifacts
+
+ The Tika build consists of a number of components and produces
+ the following main binaries:
+
+ [tika-core/target/tika-core-*.jar]
+  Tika core library. Contains the core interfaces and classes of Tika,
+  but none of the parser implementations.
+
+ 
[tika-parsers/tika-parsers-standard/tika-parsers-standard-package/target/tika-parsers-standard-package-*.jar]
+  Tika parsers. Collection of classes that implement the Tika Parser
+  interface based on various external parser libraries. This includes
+  the most commonly used parsers.  Users may want to add 
<<<tika-parser-sqlite3-package>>>
+  and <<<tika-parser-scientific-package>>> or other parser modules.
+
+ [tika-app/target/tika-app-*.jar]
+  Tika application. Combines the above components and the standard
+  parser libraries into a single runnable jar with a GUI and a command
+  line interface.
+
+ [tika-server/tika-server-standard/target/tika-server-standard-*.jar]
+  Tika JAX-RS REST application. This is a Jetty web server running Tika
+  REST services with the parsers in tika-parsers-standard-package
+  as described in 
{{{https://cwiki.apache.org/confluence/display/TIKA/TikaServer}this page}}.
+
+ [tika-bundles/tika-bundle-standard/target/tika-bundle-standard-*.jar]
+  Tika bundle. An OSGi bundle that combines tika-parsers with non-OSGified
+  parser libraries to make them easy to deploy in an OSGi environment.
+
+ [tika-eval/tika-eval-app/target/tika-eval-app-*.jar]
+  Tika eval module. Commandline tool to assess the output of Tika
+  or compare the output of two different versions of Tika or
+  other text extraction packages.
+
+
+
+Using Tika as a Maven dependency
+
+ The core library, <<<tika-core>>>, contains the key interfaces and classes
+ of Tika and can be used by itself if you don't need the full set of parsers 
+ from the <<< tika-parsers >>> component. The tika-core dependency looks like 
+ this:
+
+---
+  <dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-core</artifactId>
+    <version>2.6.0</version>
+  </dependency>
+---
+
+ If you want to use Tika to parse documents (instead  of simply detecting
+ document types, etc.), you'll want to add a dependency on at least
+ <<< tika-parsers-standard-package >>>:
+
+---
+  <dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-parsers-standard-package</artifactId>
+    <version>2.6.0</version>
+  </dependency>
+---
+
+ Note that adding this dependency will introduce a number of
+ transitive dependencies to your project.
+ You need to make sure that these dependencies won't conflict with your
+ existing project dependencies. You can use the following command in
+ the tika-parsers-standard-package directory to get a full listing of all the 
dependencies.
+
+---
+$ mvn dependency:tree | grep :compile
+---
+
+ You may also want to add one or more of the following dependencies:
+
+---
+  <dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-parser-sqlite3-package</artifactId>
+    <version>2.6.0</version>
+  </dependency>
+  <dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-parser-scientific-package</artifactId>
+    <version>2.6.0</version>
+  </dependency>
+---
+
+ You may also consider adding dependencies on modules under the  
<<<tika-parsers-ml>>> module.
+
+Using Tika in a Gradle-built project
+
+ To add a dependency on Apache Tika to your Gradle built project,
+ including the full set of parsers, you should depend on the
+ <<< tika-core >>> artifact and the
+ <<< tika-parsers-standard-package >>> artifact:
+
+---
+dependencies {
+    runtime 'org.apache.tika:tika-core:2.6.0'
+    runtime 'org.apache.tika:tika-parsers-standard-package:2.6.0'
+}
+---
+
+Using Tika in an Ant project
+
+ If you are using {{{http://ant.apache.org/ivy/}Apache Ivy}} as your
+ dependency manager tool with Ant, then to include Tika with the full set 
+ of parsers, you should depend on the <<< tika-parsers >>> artifact like this:
+
+---
+    <dependencies>
+        <dependency org="org.apache.tika" name="tika-core" rev="2.6.0"/>
+        <dependency org="org.apache.tika" name="tika-parsers-standard-package" 
rev="2.6.0"/>
+    </dependencies>
+---
+
+ Otherwise, probably the easiest way to use Tika is to include the full
+ <<< tika-app >>> jar on your classpath. For just core functionality, you
+ can add the <<< tika-core >>> jar, but be aware that the full set of
+ parsers have a large number of dependencies which must be included which
+ is very fiddly to do by hand with Ant! To include Tika in your Ant project,
+ you should do something like:
+
+---
+<classpath>
+  ... <!-- your other classpath entries -->
+
+  <!-- either: Tika Core only, no parsers -->
+  <pathelement location="path/to/tika-core-2.6.0.jar"/>
+  <!-- or: Tika with all Parsers-->
+  <pathelement location="path/to/tika-app-2.6.0.jar"/>
+
+</classpath>
+---
+
+Using Tika as a command line utility
+
+ The Tika application jar (tika-app-*.jar) can be used as a command
+ line utility for extracting text content and metadata from all sorts of
+ files. This runnable jar contains all the dependencies it needs, so
+ you don't need to worry about classpath settings to run it.
+
+ The usage instructions are shown below.
+
+---
+usage: java -jar tika-app.jar [option...] [file|port...]
+
+Options:
+    -?  or --help          Print this usage message
+    -v  or --verbose       Print debug level messages
+    -V  or --version       Print the Apache Tika version number
+
+    -g  or --gui           Start the Apache Tika GUI
+    -s  or --server        Start the Apache Tika server
+    -f  or --fork          Use Fork Mode for out-of-process extraction
+
+    --config=<tika-config.xml>
+        TikaConfig file. Must be specified before -g, -s, -f or the 
dump-x-config !
+    --dump-minimal-config  Print minimal TikaConfig
+    --dump-current-config  Print current TikaConfig
+    --dump-static-config   Print static config
+    --dump-static-full-config  Print static explicit config
+
+    -x  or --xml           Output XHTML content (default)
+    -h  or --html          Output HTML content
+    -t  or --text          Output plain text content
+    -T  or --text-main     Output plain text content (main content only)
+    -m  or --metadata      Output only metadata
+    -j  or --json          Output metadata in JSON
+    -y  or --xmp           Output metadata in XMP
+    -J  or --jsonRecursive Output metadata and content from all
+                           embedded files (choose content type
+                           with -x, -h, -t or -m; default is -x)
+    -l  or --language      Output only language
+    -d  or --detect        Detect document type
+           --digest=X      Include digest X (md2, md5, sha1,
+                               sha256, sha384, sha512
+    -eX or --encoding=X    Use output encoding X
+    -pX or --password=X    Use document password X
+    -z  or --extract       Extract all attachements into current directory
+    --extract-dir=<dir>    Specify target directory for -z
+    -r  or --pretty-print  For JSON, XML and XHTML outputs, adds newlines and
+                           whitespace, for better readability
+
+    --list-parsers
+         List the available document parsers
+    --list-parser-details
+         List the available document parsers and their supported mime types
+    --list-parser-details-apt
+         List the available document parsers and their supported mime types in 
apt format.
+    --list-detectors
+         List the available document detectors
+    --list-met-models
+         List the available metadata models, and their supported keys
+    --list-supported-types
+         List all known media types and related information
+
+
+    --compare-file-magic=<dir>
+         Compares Tika's known media types to the File(1) tool's magic 
directory
+
+Description:
+    Apache Tika will parse the file(s) specified on the
+    command line and output the extracted text content
+    or metadata to standard output.
+
+    Instead of a file name you can also specify the URL
+    of a document to be parsed.
+
+    If no file name or URL is specified (or the special
+    name "-" is used), then the standard input stream
+    is parsed. If no arguments were given and no input
+    data is available, the GUI is started instead.
+
+- GUI mode
+
+    Use the "--gui" (or "-g") option to start the
+    Apache Tika GUI. You can drag and drop files from
+    a normal file explorer to the GUI window to extract
+    text content and metadata from the files.
+
+- Batch mode
+
+    Simplest method.
+    Specify two directories as args with no other args:
+         java -jar tika-app.jar <inputDirectory> <outputDirectory>
+
+
+Batch Options:
+    -i  or --inputDir          Input directory
+    -o  or --outputDir         Output directory
+    -numConsumers              Number of processing threads
+    -bc                        Batch config file
+    -maxRestarts               Maximum number of times the
+                               watchdog process will restart the child process.
+    -timeoutThresholdMillis    Number of milliseconds allowed to a parse
+                               before the process is killed and restarted
+    -fileList                  List of files to process, with
+                               paths relative to the input directory
+    -includeFilePat            Regular expression to determine which
+                               files to process, e.g. "(?i)\.pdf"
+    -excludeFilePat            Regular expression to determine which
+                               files to avoid processing, e.g. "(?i)\.pdf"
+    -maxFileSizeBytes          Skip files longer than this value
+
+    Control the type of output with -x, -h, -t and/or -J.
+
+    To modify child process jvm args, prepend "J" as in:
+    -JXmx4g or -JDlog4j.configuration=file:log4j.xml.
+
+---
+
+ You can also use the jar as a component in a Unix pipeline or
+ as an external tool in many scripting languages.
+
+---
+# Check if an Internet resource contains a specific keyword
+curl http://.../document.doc \
+  | java -jar tika-app.jar --text \
+  | grep -q keyword
+---
+
+Wrappers
+
+  Several wrappers are available to use Tika in another programming language, 
+  such as {{{https://github.com/aviks/Taro.jl}Julia}} or 
{{{https://github.com/chrismattmann/tika-python}Python}}.

Added: tika/site/src/site/apt/2.6.0/index.apt
URL: 
http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/index.apt?rev=1905121&view=auto
==============================================================================
--- tika/site/src/site/apt/2.6.0/index.apt (added)
+++ tika/site/src/site/apt/2.6.0/index.apt Mon Nov  7 11:40:42 2022
@@ -0,0 +1,53 @@
+                     ----------------
+                     Apache Tika 1.27
+                     ----------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+
+Apache Tika 2.6.0
+
+        The most notable changes in Tika 2.6.0 over the previous release are:
+
+        * Add optional Siegfried detector 
({{{http://issues.apache.org/jira/browse/TIKA-3901}TIKA-3901}}).
+
+        * Move OverrideDetector's functionality to the CompositeDetector 
({{{http://issues.apache.org/jira/browse/TIKA-3904}TIKA-3904}}).
+
+        * The FileCommandDetector has been refactored to have the same 
behavior as the Siegfried detector; see setUseMime in the javadoc 
({{{http://issues.apache.org/jira/browse/TIKA-3902}TIKA-3902}}).
+
+        * Fix bug in OpenSearch emitter that prevented upserts on documents 
with embedded files 
({{{http://issues.apache.org/jira/browse/TIKA-3882}TIKA-3882}}).
+
+        * Extract PDF actions and triggers into the file's metadata 
({{{http://issues.apache.org/jira/browse/TIKA-3887}TIKA-3887}}).
+
+        * Add a tika-async-cli module 
({{{http://issues.apache.org/jira/browse/TIKA-3885}TIKA-3885}}).
+
+
+  The following people have contributed to Tika 2.6.0 by submitting or
+           commenting on the issues resolved in this release:
+
+     * Dave Meikle
+
+     * Ethan Wilansky
+
+     * Luca Perico
+
+     * Tilman Hausherr
+
+     * Tim Allison
+
+     * Tong Wang
+
+   See {{https://s.apache.org/zrcax}} for more details on these contributions.

svn commit: r1905121 [1/2] - in /tika/site: ./ src/site/ src/site/apt/ src/site/apt/2.6.0/ src/site/resources/

Reply via email to