Author: tallison Date: Mon Nov 7 11:40:42 2022 New Revision: 1905121 URL: http://svn.apache.org/viewvc?rev=1905121&view=rev Log: Update website for 2.6.0 release
Added: tika/site/src/site/apt/2.6.0/ tika/site/src/site/apt/2.6.0/configuring.apt tika/site/src/site/apt/2.6.0/detection.apt tika/site/src/site/apt/2.6.0/examples.apt tika/site/src/site/apt/2.6.0/formats.apt tika/site/src/site/apt/2.6.0/gettingstarted.apt tika/site/src/site/apt/2.6.0/index.apt tika/site/src/site/apt/2.6.0/parser.apt tika/site/src/site/apt/2.6.0/parser_guide.apt Modified: tika/site/pom.xml tika/site/src/site/apt/index.apt.vm tika/site/src/site/resources/doap.rdf tika/site/src/site/site.xml Modified: tika/site/pom.xml URL: http://svn.apache.org/viewvc/tika/site/pom.xml?rev=1905121&r1=1905120&r2=1905121&view=diff ============================================================================== --- tika/site/pom.xml (original) +++ tika/site/pom.xml Mon Nov 7 11:40:42 2022 @@ -28,7 +28,7 @@ <parent> <groupId>org.apache.tika</groupId> <artifactId>tika-parent</artifactId> - <version>2.5.0</version> + <version>2.6.0</version> </parent> <artifactId>tika-site</artifactId> Added: tika/site/src/site/apt/2.6.0/configuring.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/configuring.apt?rev=1905121&view=auto ============================================================================== --- tika/site/src/site/apt/2.6.0/configuring.apt (added) +++ tika/site/src/site/apt/2.6.0/configuring.apt Mon Nov 7 11:40:42 2022 @@ -0,0 +1,223 @@ + ---------------- + Configuring Tika + ---------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Configuring Tika + + Out of the box, Apache Tika will attempt to start with all available + Detectors and Parsers, running with sensible defaults. For most users, + this default configuration will work well. + + This page gives you information on how to configure the various + components of Apache Tika, such as Parsers and Detectors, if you need + fine-grained control over ordering, exclusions and the like. + +%{toc|section=1|fromDepth=1} + +* {Configuring Parsers} + + Through the Tika Config xml, it is possible to have a high degree of control + over which parsers are or aren't used, in what order of preferences etc. It + is also possible to override just certain parts, to (for example) have "default + except for PDF". + + Currently, it is only possible to have a single parser run against a document. + There is on-going discussion around fallback parsers and combining the output + of multiple parsers running on a document, but none of these are available yet. + + To override some parser certain default behaviours, include the <<< DefaultParser >>> + in your configuration, with excludes, then add other parser definitions in. + To prevent the <<< DefaultParser >>> (with its auto-discovery) being used, + simply omit it from your config, and list all other parsers you want instead. + + To override just some default behaviour, you can use a Tika Config something + like this: + +--- +<?xml version="1.0" encoding="UTF-8"?> +<properties> + <parsers> + <!-- Default Parser for most things, except for 2 mime types, and never + use the Executable Parser --> + <parser class="org.apache.tika.parser.DefaultParser"> + <mime-exclude>image/jpeg</mime-exclude> + <mime-exclude>application/pdf</mime-exclude> + <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/> + </parser> + <!-- Use a different parser for PDF --> + <parser class="org.apache.tika.parser.EmptyParser"> + <mime>application/pdf</mime> + </parser> + </parsers> +</properties> +--- + + To configure things in code, the key classes to use to build up your own custom + parser heirarchy are + {{{./api/org/apache/tika/parser/DefaultParser.html}org.apache.tika.parser.DefaultParser}}, + {{{./api/org/apache/tika/parser/CompositeParser.html}org.apache.tika.parser.CompositeParser}} + and + {{{./api/org/apache/tika/parser/ParserDecorator.html}org.apache.tika.parser.ParserDecorator}}. + +* {Configuring Detectors} + + Through the Tika Config xml, it is possible to have a high degree of control + over which detectors are or aren't used, in what order of preferences etc. It + is also possible to override just certain parts, to (for example) have "default + except for no POIFS Container Detction". + + To override some detector certain default behaviours, include the + <<< DefaultDetector >>>, with any <<< detector-exclude >>> entries you need, + in your configuration, then add other detectors definitions in. To prevent + the <<< DefaultParser >>> (with its auto-discovery) being used, simply omit it + from your config, and list all other detectors you want instead. + + To override just some default behaviour, you can use a Tika Config something + like this: + +--- +<?xml version="1.0" encoding="UTF-8"?> +<properties> + <detectors> + <!-- All detectors except built-in container ones --> + <detector class="org.apache.tika.detect.DefaultDetector"> + <detector-exclude class="org.apache.tika.parser.pkg.ZipContainerDetector"/> + <detector-exclude class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/> + </detector> + </detectors> +</properties> +--- + + Or to just only use certain detectors, you can use a Tika Config something + like this: + +--- +<?xml version="1.0" encoding="UTF-8"?> +<properties> + <detectors> + <!-- Only use these two detectors, and ignore all others --> + <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/> + <detector class="org.apache.tika.mime.MimeTypes"/> + </detectors> +</properties> +--- + + In code, the key classes to use to build up your own custom detector + heirarchy are + {{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}} + and + {{{./api/org/apache/tika/detect/CompositeDetector.html}org.apache.tika.detect.CompositeDetector}}. + +* {Configuring Mime Types} + + TODO Mention non-standard paths, and custom mime type files + +* {Configuring Language Identifiers} + + At this time, there is no unified way to configure language identifiers. + While the work on that is ongoing, for now you will need to review the + {{{./api/}Tika Javadocs}} to see how individual identifiers are configured. + +* {Configuring Translators} + + At this time, there is no unified way to configure Translators. + While the work on that is ongoing, for now you will need to review the + {{{./api/}Tika Javadocs}} to see how individual Translators are configured. + +~~ When Translators can have their parameters configured, mention here about +~~ specifying which single one to use in the Tika Config XML + +* {Configuring the Service Loader} + + Tika has a number of service provider types such as parsers, detectors, and translators. + The {{{./api/org/apache/tika/config/ServiceLoader.html}org.apache.tika.config.ServiceLoader}} class provides a registry of each type of provider. This allows Tika to create + implementations such as {{{./api/org/apache/tika/parser/DefaultParser.html}org.apache.tika.parser.DefaultParser}}, + {{{./api/org/apache/tika/language/translate/DefaultTranslator.html}org.apache.tika.language.translate.DefaultTranslator}}, and {{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}} + that can match the appropriate provider to an incoming piece of content. + + The ServiceLoader's registry can be populated either statically or dynamically. + +** Static + + Static loading is the default which requires no configuration. This configuration options is used in + Tika deployments where the Tika JAR files reside together in the same classloader hierarchy. The services + provides are loaded from provider configuration files located within the tika-parsers JAR file at META-INF/services. + +** Dynamic + + Dynamic loading may be required if the tika service providers will reside in different classloaders such as + in OSGi. To allow a provider created in tika-config.xml to utilize dynamically loaded services you need to + configure the ServiceLoader to be dynamic with the following configuration: + +--- +<properties> + <service-loader dynamic="true"/> + .... +</properties> +--- + +** Load Error Handling + + The ServiceLoader can contains a handler to deal with errors that occur during provider initialization. For example + if a class fails to initialize LoadErrorHandler deals with the exception that is thrown. + This handler can be configured to: + + * <<< IGNORE >>> - (Default) Do nothing when providers fail to initialize. + + * <<< WARN >>> - Log a warning when providers fail to initialize. + + * <<< THROW >>> - Throw an exception when providers fail to initialize. + + [] + + For example to set the LoadErrorHandler to WARN then use the following configuration: + +--- +<properties> + <service-loader loadErrorHandler="WARN"/> + .... +</properties> +--- + +* {Using a Tika Configuration XML file} + + However you call Tika, the System Property of <<< tika.config >>> is + checked first, and the Environment Variable of <<< TIKA_CONFIG >>> is + tried next. Setting one of those will cause Tika to use your given + Tika Config XML file. + + If you are calling Tika from your own code, then you can pass in the + location of your Tika Config XML file when you construct your + <<<TikaConfig>>> instance. From that, you can fetch your configured + parser, detectors etc. + +--- +TikaConfig config = new TikaConfig("/path/to/tika-config.xml"); +Detector detector = config.getDetector(); +Parser autoDetectParser = new AutoDetectParser(config); +--- + + For users of the Tika App, in addition to the sytem property and the + environement variable, you can also use the + <<< --config=[tika-config.xml] >>> option to select a different + Tika Config XML file to use + + For users of the Tika Server, in addition to the sytem property and the + environement variable, you can also use <<< -c [tika-config.xml] >>> or + <<< --config [tika-config.xml] >>> options to select a different + Tika Config XML file to use Added: tika/site/src/site/apt/2.6.0/detection.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/detection.apt?rev=1905121&view=auto ============================================================================== --- tika/site/src/site/apt/2.6.0/detection.apt (added) +++ tika/site/src/site/apt/2.6.0/detection.apt Mon Nov 7 11:40:42 2022 @@ -0,0 +1,223 @@ + ----------------- + Content Detection + ----------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Content Detection + + This page gives you information on how content and language detection + works with Apache Tika, and how to tune the behaviour of Tika. + +%{toc|section=1|fromDepth=1} + +* {The Detector Interface} + + The + {{{./api/org/apache/tika/detect/Detector.html}org.apache.tika.detect.Detector}} + interface is the basis for most of the content type detection in Apache + Tika. All the different ways of detecting content all implement the + same common method: + +--- +MediaType detect(java.io.InputStream input, + Metadata metadata) throws java.io.IOException +--- + + The <<<detect>>> method takes the stream to inspect, and a + <<<Metadata>>> object that holds any additional information on + the content. The detector will return a + {{{./api/org/apache/tika/mime/MediaType.html}MediaType}} object describing + its best guess as to the type of the file. + + In general, three keys on the Metadata object are used by Detectors. + These are <<<TikaCoreProperties.RESOURCE_NAME_KEY>>> which should hold the name + of the file (where known), and <<<Metadata.CONTENT_TYPE>>> which should + hold the advertised content type of the file (eg from a webserver or + a content repository). Users may override automatic detection with the + <<<TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE>>> key. + + +* {Mime Magic Detection} + + By looking for special ("magic") patterns of bytes near the start of + the file, it is often possible to detect the type of the file. For + some file types, this is a simple process. For others, typically + container based formats, the magic detection may not be enough. (More + detail on detecting container formats below) + + Tika is able to make use of a a mime magic info file, in the + {{{http://www.freedesktop.org/standards/shared-mime-info}Freedesktop MIME-info}} + format to peform mime magic detection. (Note that Tika supports a few + more match types than Freedesktop does) + + This is provided within Tika by + {{{./api/org/apache/tika/detect/MagicDetector.html}org.apache.tika.detect.MagicDetector}}. It is most commonly access via + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}, + normally sourced from the <<<tika-mimetypes.xml>>> and <<<custom-mimetypes.xml>>> + files. For more information on defining your own custom mimetypes, see + {{{./parser_guide.html#Add_your_MIME-Type}the new parser guide}}. + + +* {Resource Name Based Detection} + + Where the name of the file is known, it is sometimes possible to guess + the file type from the name or extension. Within the + <<<tika-mimetypes.xml>>> file is a list of patterns which are used to + identify the type from the filename. + + However, because files may be renamed, this method of detection is quick + but not always as accurate. + + This is provided within Tika by + {{{./api/org/apache/tika/detect/NameDetector.html}org.apache.tika.detect.NameDetector}}. + + +* {Known Content Type "Detection} + + Sometimes, the mime type for a file is already known, such as when + downloading from a webserver, or when retrieving from a content store. + This information can be used by detectors, such as + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}, + + +* {The default Mime Types Detector} + + By default, the mime type detection in Tika is provided by + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}. + This detector makes use of <<<tika-mimetypes.xml>>> to power + magic based and filename based detection. + + Firstly, magic based detection is used on the start of the file. + If the file is an XML file, then the start of the XML is processed + to look for root elements. Next, if available, the filename + (from <<<TikaCoreProperties.RESOURCE_NAME_KEY>>>) is + then used to improve the detail of the detection, such as when magic + detects a text file, and the filename hints it's really a CSV. Finally, + if available, the supplied content type (from <<<Metadata.CONTENT_TYPE>>>) + is used to further refine the type. + + +* {Container Aware Detection} + + Several common file formats are actually held within a common container + format. One example is the PowerPoint .ppt and Word .doc formats, which + are both held within an OLE2 container. Another is Apple iWork formats, + which are actually a series of XML files within a Zip file. + + Using magic detection, it is easy to spot that a given file is an OLE2 + document, or a Zip file. Using magic detection alone, it is very difficult + (and often impossible) to tell what kind of file lives inside the container. + + For some use cases, speed is important, so having a quick way to know the + container type is sufficient. For other cases however, you don't mind + spending a bit of time (and memory!) processing the container to get a + more accurate answer on its contents. For these cases, the additional + container aware detectors contained in the <<<Tika Parsers>>> jar should + be used. + + Tika provides a wrapping detector in the form of + {{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}}. + This uses the service loader to discover all available detectors, including + any available container aware ones, and tries them in turn. For container + aware detection, include the <<<Tika Parsers>>> jar and its dependencies + in your project, then use DefaultDetector along with a <<<TikaInputStream>>>. + + Because these container detectors needs to read the whole file to open and + inspect the container, they must be used with a + {{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}. + If called with a regular <<<InputStream>>>, then all work will be done + by the default Mime Magic detection only. + + For more information on container formats and Tika, see + {{{http://wiki.apache.org/tika/MetadataDiscussion}}} + + +* {The default Tika Detector} + + Just as with Parsers, Tika provides a special detector + {{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}} + which auto-detects (based on service files) the available detectors at + runtime, and tries these in turn to identify the file type. + + If only <<<Tika Core>>> is available, the Default Detector will work only + with Mime Magic and Resource Name detection. However, if <<<Tika Parsers>>> + (and its dependencies!) are available, additional detectors which known about + containers (such as zip and ole2) will be used as appropriate, provided that + detection is being performed with a + {{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}. + Custom detectors can also be used as desired, they simply need to be listed + in a service file much as is done for + {{{./parser_guide.html#List_the_new_parser}custom parsers}}. + + +* {Ways of triggering Detection} + + The simplest way to detect is through the + {{{./api/org/apache/tika/Tika.html}Tika Facade class}}, which provides methods to + detect based on + {{{./api/org/apache/tika/Tika.html##detect(java.io.File)}File}}, + {{{./api/org/apache/tika/Tika.html##detect(java.io.InputStream)}InputStream}}, + {{{./api/org/apache/tika/Tika.html##detect(java.io.InputStream, java.lang.String)}InputStream and Filename}}, + {{{./api/org/apache/tika/Tika.html##detect(java.lang.String)}Filename}} or a few others. + It works best with a File or + {{{./api/org/apache/tika/io/TikaInputStream.html}TikaInputStream}}. + + Alternately, detection can be performed on a specific Detector, or using + <<<DefaultDetector>>> to have all available Detectors used. A typical pattern + would be something like: + +--- +TikaConfig tika = new TikaConfig(); + +for (File f : myListOfFiles) { + Metadata metadata = new Metadata(); + //TikaInputStream sets the TikaCoreProperties.RESOURCE_NAME_KEY + //when initialized with a file or path + String mimetype = tika.getDetector().detect( + TikaInputStream.get(f, metadata), metadata); + System.out.println("File " + f + " is " + mimetype); +} +for (InputStream is : myListOfStreams) { + Metadata metadata = new Metadata(); + //if you know the file name, it is a good idea to + //set it in the metadata, e.g. + //metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, "somefile.pdf"); + String mimetype = tika.getDetector().detect( + TikaInputStream.get(is), metadata); + System.out.println("Stream " + is + " is " + mimetype); +} +--- + +* {Language Detection} + + Tika is able to help identify the language of a piece of text, which + is useful when extracting text from document formats which do not include + language information in their metadata. + + The language detection is provided by etensions of the + {{{./api/org/apache/tika/language/detect/LanguageDetector.html}org.apache.tika.language.detect.LanguageDetector}}. + This provides choice for developers looking to compare and contrast differing + language detection implementations. + + Some Java code example of language detection can be found at {{{https://github.com/apache/tika/blob/main/tika-example/src/main/java/org/apache/tika/example/LanguageDetectorExample.java}LanguageDetectorExample.java}}, + {{{https://github.com/apache/tika/blob/main/tika-example/src/main/java/org/apache/tika/example/LanguageDetectingParser.java}LanguageDetectingParser.java}} + and {{{https://github.com/apache/tika/blob/main/tika-example/src/main/java/org/apache/tika/example/Language.java}Language.java}}. + +* {More Examples} + + For more examples of Detection using Apache Tika, please take a look at + the {{{./examples.html}Tika Examples page}}. Added: tika/site/src/site/apt/2.6.0/examples.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/examples.apt?rev=1905121&view=auto ============================================================================== --- tika/site/src/site/apt/2.6.0/examples.apt (added) +++ tika/site/src/site/apt/2.6.0/examples.apt Mon Nov 7 11:40:42 2022 @@ -0,0 +1,148 @@ + ----------------------- + Tika API Usage Examples + ----------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Apache Tika API Usage Examples + + This page provides a number of examples on how to use the various + Tika APIs. All of the examples shown are also available in the + {{{https://svn.apache.org/repos/asf/tika/trunk/tika-example}Tika Example + module}} in SVN. + +%{toc|section=1|fromDepth=1} + + +* {Parsing} + + Tika provides a number of different ways to parse a file. These provide + different levels of control, flexibility, and complexity. + +** {Parsing using the Tika Facade} + + The {{{./api/org/apache/tika/Tika.html}Tika facade}}, + provides a number of very quick and easy ways to have your content + parsed by Tika, and return the resulting plain text + +%{include|source=src/examples-src/main/java/org/apache/tika/example/ParsingExample.java|snippet=aj:..parseToStringExample()|show-gutter=false} + +** {Parsing using the Auto-Detect Parser} + + For more control, you can call the + {{{./api/org/apache/tika/parser/Parser.html}Tika Parsers}} + directly. Most likely, you'll want to start out using the + {{{./api/org/apache/tika/parser/AutoDetectParser.html}Auto-Detect Parser}}, + which automatically figures out what kind of content you have, then calls the appropriate + parser for you. + +%{include|source=src/examples-src/main/java/org/apache/tika/example/ParsingExample.java|snippet=aj:..parseExample()|show-gutter=false} + + +* {Picking different output formats} + + With Tika, you can get the textual content of your files returned + in a number of different formats. These can be plain text, html, xhtml, + xhtml of one part of the file etc. This is controlled based on the + {{{http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}} + you supply to the Parser. + +** {Parsing to Plain Text} + + By using the + {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}, + you can request that Tika return only the content of the document's body as + a plain-text string. + +%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToPlainText()|show-gutter=false} + +** {Parsing to XHTML} + + By using the + {{{./api/org/apache/tika/sax/ToXMLContentHandler.html}ToXMLContentHandler}}, + you can get the XHTML content of the whole document as a string. + +%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToHTML()|show-gutter=false} + + If you just want the body of the xhtml document, without the header, you + can chain together a + {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}} + and a {{{./api/org/apache/tika/sax/ToXMLContentHandler.html}ToXMLContentHandler}} + as shown: + +%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseBodyToHTML()|show-gutter=false} + +** {Fetching just certain bits of the XHTML} + + It possible to execute XPath queries on the parse results, to fetch + only certain bits of the XHTML. + +%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseOnePartToHTML()|show-gutter=false} + + +* {Custom Content Handlers} + + The textual output of parsing a file with Tika is returned via the SAX + {{{http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}} + you pass to the parse method. It is possible to customise your parsing by supplying your + own ContentHandler which does special things. + +** {Extract Phone Numbers from Content into the Metadata} + + By using the + {{{./api/org/apache/tika/sax/PhoneExtractingContentHandler.html}PhoneExtractingContentHandler}}, + you can have any phone numbers found in the textual content of the document extracted and placed + into the Metadata object for you. + +%{include|source=src/examples-src/main/java/org/apache/tika/example/GrabPhoneNumbersExample.java|snippet=aj:..process(..File)|show-gutter=false} + +** {Streaming the plain text in chunks} + + Sometimes, you want to chunk the resulting text up, perhaps to output + as you go minimising memory use, perhaps to output to HDFS files, or + any other reason! With a small custom content handler, you can do that. + +%{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToPlainTextChunks()|show-gutter=false} + + +* {Translation} + + Tika provides a pluggable Translation system, which allow you to send the results of + parsing off to an external system or program to have the text translated into another + language. + +** {Translation using the Microsoft Translation API} + + In order to use the Microsoft Translation API, you need to sign up for a Microsoft account, + get an API key, then pass the key to Tika before translating. + +%{include|source=src/examples-src/main/java/org/apache/tika/example/TranslatorExample.java|snippet=aj:..microsoftTranslateToFrench(..String)|show-gutter=false} + + +* {Language Identification} + + Tika provides support for identifying the language of text, through the + {{{./api/org/apache/tika/language/LanguageIdentifier.html}LanguageIdentifier}} class. + +%{include|source=src/examples-src/main/java/org/apache/tika/example/LanguageIdentifierExample.java|snippet=aj:..identifyLanguage(..String)|show-gutter=false} + +* {Additional Examples} + + A number of other examples are also available, including all of the examples + from the {{{http://manning.com/mattmann/}Tika In Action book}}. These can all + be found in the + {{{https://svn.apache.org/repos/asf/tika/trunk/tika-example}Tika Example + module}} in SVN. Added: tika/site/src/site/apt/2.6.0/formats.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/formats.apt?rev=1905121&view=auto ============================================================================== --- tika/site/src/site/apt/2.6.0/formats.apt (added) +++ tika/site/src/site/apt/2.6.0/formats.apt Mon Nov 7 11:40:42 2022 @@ -0,0 +1,1066 @@ + -------------------------- + Supported Document Formats + -------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Supported Document Formats + + This page lists all the document formats supported by the parsers in + Apache Tika 2.6.0. Follow the links to the various parser class javadocs + for more detailed information about each document format and how it is + parsed by Tika. + + <<Please note>> that Apache Tika is able to detect a much wider range of + formats than those listed below, this page only documents those formats + from which Tika is able to extract metadata and/or textual content. + +%{toc|fromDepth=1} + +* {HyperText Markup Language} + + The HyperText Markup Language (HTML) is the lingua franca of the web. + Tika uses the {{{http://home.ccil.org/~cowan/XML/tagsoup/}TagSoup}} + library to support virtually any kind of HTML found on the web. + The output from the + {{{./api/org/apache/tika/parser/html/HtmlParser.html}HtmlParser}} class + is guaranteed to be well-formed and valid XHTML, and various heuristics + are used to prevent things like inline scripts from cluttering the + extracted text content. + +* {XML and derived formats} + + The Extensible Markup Language (XML) format is a generic format that can + be used for all kinds of content. Tika has custom parsers for some widely + used XML vocabularies like XHTML, OOXML and ODF, but the default + {{{./api/org/apache/tika/parser/xml/DcXMLParser.html}DcXMLParser}} + class simply extracts the text content of the document and ignores any XML + structure. The only exception to this rule are Dublin Core metadata + elements that are used for the document metadata. + +* {Microsoft Office document formats} + + Microsoft Office and some related applications produce documents in the + generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The + older OLE 2 format was introduced in Microsoft Office version 97 and was + the default format until Office version 2007 and the new XML-based + OOXML format. The + {{{./api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}} + and + {{{./api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html}OOXMLParser}} + classes use {{{http://poi.apache.org/}Apache POI}} libraries to support + text and metadata extraction from both OLE2 and OOXML documents. + + Old, pre-OLE2 Excel files (Excel 2, 3 and 4) are handled by the + {{{./api/org/apache/tika/parser/microsoft/OldExcelParser.html}OldExcelParser}}. + + The older, pre-OOXML pure-XML, office file formats are handled by + {{{./api/org/apache/tika/parser/microsoft/xml/SpreadsheetMLParser.html}SpreadsheetMLParser}}, + {{{./api/org/apache/tika/parser/microsoft/xml/WordMLParser.html}WordMLParser}} + and + {{{./api/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParser.html}Word2006MLParser}}. + + Temporary Office lock files (owner files) are supported for basic metadata + extraction by + {{{./api/org/apache/tika/parser/microsoft/MSOwnerFileParser.html}MSOwnerFileParser}}. + +* {OpenDocument Format} + + The OpenDocument format (ODF) is used most notably as the default format + of the OpenOffice.org office suite. The + {{{./api/org/apache/tika/parser/odf/OpenDocumentParser.html}OpenDocumentParser}} + class supports this format and the earlier OpenOffice 1.0 format on which + ODF is based. + +* {iWorks document formats} + + The various iWorks document formats (Numbers, Pages, Keynote) are supported + by the + {{{./api/org/apache/tika/parser/iwork/IWorkPackageParser.html}IWorkPackageParser}} + class, which extracts text and metadata. + +* {WordPerfect document formats} + + The Corel WordPerfect Office Suite formats are supported by + {{{./api/org/apache/tika/parser/wordperfect/WordPerfectParser.html}WordPerfectParser}}, + supporting WordPerfect WP6+ files, and + {{{./api/org/apache/tika/parser/wordperfect/QuattroProParser.html}QuattroProParser}}, + supporting QuattroPro QPW v9+ files. + +* {Portable Document Format} + + The {{{./api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class + parsers Portable Document Format (PDF) documents using the + {{{http://pdfbox.apache.org/}Apache PDFBox}} library. + +* {Electronic Publication Format} + + The {{{./api/org/apache/tika/parser/epub/EpubParser.html}EpubParser}} class + supports the Electronic Publication Format (EPUB) used for many digital + books. + + The {{{./api/org/apache/tika/parser/xml/FictionBookParser.html}FictionBookParser}} class + supports the xml-based Fiction Book publishing format. + +* {Rich Text Format} + + The {{{./api/org/apache/tika/parser/rtf/RTFParser.html}RTFParser}} class + uses the standard javax.swing.text.rtf feature to extract text content + from Rich Text Format (RTF) documents. + +* {Compression and packaging formats} + + Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}} + library to support various compression and packaging formats. The + {{{./api/org/apache/tika/parser/pkg/CompressorParser.html}CompressorParser}} + class handles parsing of the top level compression formats, then + {{{./api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}} + class and its subclasses parse the packaging formats and then pass the + unpacked document streams to a second parsing stage using the parser + instance specified in the parse context. Formats supported include Tar, + AR, ARJ, CPIO, Dump, Zip, 7Zip, Gzip, BZip2, XZ, LZMA, Z and Pack200. + + Additionally, the + {{{./api/org/apache/tika/parser/pkg/RarParser.html}RarParser}} class + supports the RAR archive format, which isn't supported by Commons Compress. + + The + {{{./api/org/apache/tika/parser/apple/AppleSingleFileParser.html}AppleSingleFileParser}} + class supports resources packaged within AppleSingle and AppleDouble + files. + +* {Text formats} + + Extracting text content from plain text files seems like a simple task + until you start thinking of all the possible character encodings. The + {{{./api/org/apache/tika/parser/txt/TXTParser.html}TXTParser}} class uses + encoding detection code from the {{{http://site.icu-project.org/}ICU}} + project to automatically detect the character encoding of a text document. + +* {Feed and Syndication formats} + + The {{{./api/org/apache/tika/parser/feed/FeedParser.html}FeedParser}} class + supports the RSS and Atom feed syndication formats. + + The {{{./api/org/apache/tika/parser/iptc/IptcAnpaParser.html}IptcAnpaParser}} class + supports the IPTC ANPA News Wire feed format. + +* {Help formats} + + The {{{./api/org/apache/tika/parser/chm/ChmParser.html}ChmParser}} class + supports the CHM Help format. + +* {Audio formats} + + Tika can detect several common audio formats and extract metadata + from them. Even text extraction is supported for some audio files that + contain lyrics or other textual content. Extracted metadata includes + sampling rates, channels, format information, artists, titles etc. The + {{{./api/org/apache/tika/parser/audio/AudioParser.html}AudioParser}} + and {{{./api/org/apache/tika/parser/audio/MidiParser.html}MidiParser}} + classes use standard javax.sound features to process simple audio + formats. The + {{{./api/org/apache/tika/parser/mp3/Mp3Parser.html}Mp3Parser}} class + adds support for the widely used MP3 format, and the + {{{./api/org/apache/tika/parser/mp4/MP4Parser.html}MP4Parser}} class + provides it for MP4 audio. The Ogg family of audio formats (Vorbis, + Speex, Opus, Flac etc) are supported by the + {{{./api/org/gagravarr/tika/VorbisParser.html}VorbisParser}}, + {{{./api/org/gagravarr/tika/OpusParser.html}OpusParser}}, + {{{./api/org/gagravarr/tika/SpeexParser.html}SpeexParser}} and + {{{./api/org/gagravarr/tika/FlacParser.html}FlacParser}} + classes. + +* {Image formats} + + The {{{./api/org/apache/tika/parser/image/ImageParser.html}ImageParser}} + class uses the standard javax.imageio feature to extract simple metadata + from image formats supported by the Java platform, such as PNG, GIF + and BMP. More complex image metadata is available through the + {{{./api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class and + {{{./api/org/apache/tika/parser/image/TiffParser.html}TiffParser}} classes + that uses the metadata-extractor library to supports Exif metadata + extraction from Jpeg and Tiff images. The + {{{./api/org/apache/tika/parser/image/PSDParser.html}PSDParser}} class + extracts metadata from PSD images. The + {{{./api/org/apache/tika/parser/image/BPGParser.html}BPGParser}} class + extracts simple metadata from BPG (Better Portable Graphics) images. + The {{{./api/org/apache/tika/parser/image/WebPParser.html}WebPParser}} + class extracts simple metadata from WebP image format. + The {{{./api/org/apache/tika/parser/image/ICNSParser.html}ICNSParser}} + class extracts simple metadata from the Apple ICNS icon image format. + + When extracting from images, it is also possible to chain in Tesseract, via + the {{{./api/org/apache/tika/parser/ocr/TesseractOCRParser.html}TesseractOCRParser}}, + to have OCR performed on the contents of the image. + + The {{{./api/org/apache/tika/parser/microsoft/WMFParser.html}WMFParser}} + class extracts simple text from Microsoft WMF drawings. + The {{{./api/org/apache/tika/parser/microsoft/EMFParser.html}EMFParser}} + class extracts simple text from Microsoft EMF drawings, along with + exposing any embedded other resources / files. + +* {Video formats} + + Tika supports the Flash video format using a simple parsing algorithm + implemented in the + {{{./api/org/apache/tika/parser/video/FLVParser}FLVParser}} class. + + The MP4 family of video formats (MP4, Quicktime, 3GPP etc) is supported + by the {{{./api/org/apache/tika/parser/mp4/MP4Parser}MP4Parser}} class, + which extracts metadata on the video, along with audio stream + (if present). + + For the Ogg family of video formats, a limited amount of metadata is + extracted by the + {{{./api/org/gagravarr/tika/OggParser.html}OggParser}} class. There is + also an experimental + {{{./api/org/gagravarr/tika/TheoraParser.html}TheoraParser}} class which + extracts only limited metadata, pending a consensus on the "right" way + to return metadata for audio streams along with the video metadata. + + As an alternative to the metadata-focused parsers above, the + {{{./api/org/apache/tika/parser/pot/PooledTimeSeriesParser}PooledTimeSeriesParser}} + can be used (if the required tool is installed) to generate a numeric + representation of the video suitable for similarity searches. More details + on this approach, and setup instructions for the parser + tool, can be + found on {{{https://wiki.apache.org/tika/PooledTimeSeriesParser}the Tika + wiki page for the parser}}. + +* {Java class files and archives} + + The {{{./api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class + extracts class names and method signatures from Java class files, and + the {{{./api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}} class + supports also jar archives. + +* {Source code} + + The {{{./api/org/apache/tika/parser/code/SourceCodeParser}SourceCodeParser}} class + handles a number of source code formats, including Java, C, C++ and Groovy. + It provides a formatted form of the code, along with some simple metadata. + +* {Mail formats} + + The {{{./api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can + extract email messages from the mbox format used by many email archives + and Unix-style mailboxes. + + The {{{./api/org/apache/tika/parser/mail/RFC822Parser.html}RFC822Parser}} can + process single email messages in the RFC 822 format used by many email clients + in their archives / exports. + + The {{{./api/org/apache/tika/parser/mbox/OutlookPSTParser.html}OutlookPSTParser}} can + extract email messages from the Microsoft Outlook PST email format. + + The {{{./api/org/apache/tika/parser/microsoft/OutlookExtractor.html}OutlookExtractor}} (part of + {{{./api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}}) + is able to extract email messages from the Microsoft Outlook MSG email + format. + + The {{{./api/org/apache/tika/parser/microsoft/TNEFParser.html}TNEFParser}} can + extract email attachments from the Microsoft TNEF (Transport Neutral Encoding + Format, aka Winmail.dat) used with some Microsoft email clients. + +* {CAD formats} + + The {{{./api/org/apache/tika/parser/dwg/DWGParser.html}DWGParser}} can + extract simple metadata from the DWG CAD format. + +* {Font formats} + + The {{{./api/org/apache/tika/parser/font/TrueTypeParser.html}TrueTypeParser}} + class can extract simple metadata from the TrueType font format. + The {{{./api/org/apache/tika/parser/font/AdobeFontMetricParser.html}AdobeFontMetricParser}} + class does something similar for Adobe Font Metrics files. + +* {Scientific formats} + + The {{{./api/org/apache/tika/parser/dif/DIFParser.html}DIFParser}} + is able to extract attribute metadata from the GCMD Directory + Interchange Format (DIF) scientific file format. + + The {{{./api/org/apache/tika/parser/gdal/GDALParser.html}GDALParser}} + is able to extract attribute metadata from the GDAL scientific file format. + + The {{{./api/org/apache/tika/parser/geoinfo/GeographicInformationParser.html}GeographicInformationParser}} + is able to extract attribute metadata from the ISO-19139 georgraphic + information file format. + + The {{{./api/org/apache/tika/parser/geo/topic/GeoParser.html}GeoParser}} + is makes use of a pre-built collection of a geographic gazetteer, to + resolve geographic entities into their positions into the metadata + + The {{{./api/org/apache/tika/parser/grib/GribParser.html}GribParser}} + is able to extract attribute metadata from the Grib scientific file format. + + The {{{./api/org/apache/tika/parser/hdf/HDFParser.html}HDFParser}} + is able to extract attribute metadata from the HDF scientific file format. + + The {{{./api/org/apache/tika/parser/isatab/ISArchiveParser.html}ISArchiveParser}} + is able to extract attribute metadata from the ISA-Tab (ISA Tools) family of + scientific file formats. + + The {{{./api/org/apache/tika/parser/netcdf/NetCDFParser.html}NetCDFParser}} + is able to extract attribute metadata from the NetCDF scientific file format. + + The {{{./api/org/apache/tika/parser/mat/MatParser.html}MatParser}} + is able to extract attribute metadata from the Matlab scientific file format. + +* {Executable programs and libraries} + + The {{{./api/org/apache/tika/parser/executable/ExecutableParser.html}ExecutableParser}} can + extract metadata information on platforms, architectures and types from a range + of executable formats and libraries, such as Windows Executables and Linux / BSD + programs and libraries. + +* {Crypto formats} + + The {{{./api/org/apache/tika/parser/crypto/Pkcs7Parser.html}Pkcs7Parser}} is able to + parse the contents of PKCS7 signed messages, but doesn't include any information from + the outer PKCS7 wrapper. + + The {{{./api/org/apache/tika/parser/crypto/TSDParser.html}TSDParser}} class + processes metadata from Time Stamped Data Envelope files, as well as exposing the + contents stored within the TSD wrapper. + +* {Database formats} + + The {{{./api/org/apache/tika/parser/jdbc/SQLite3Parser.html}SQLite3Parser}} is able to + extract content from SQLite3 files, in a tabular form. However, it requires that the + {{{http://xerial.org/software/}org.xerial sqlite-jdbc jar}} is manually added to + the classpath first, as that binary jar isn't shipped as standard. + + The {{{./api/org/apache/tika/parser/microsoft/JackcessParser.html}JackcessParser}} is + able to extract metadata and content in a tabular form, from Microsoft Access + database files. + + The {{{./api/org/apache/tika/parser/dbf/DBFParser.html}DBFParser}} currently + supports versions of dBase files (dbf) before version 7. dBase formats are + used in many legacy database systems, including + dBase, FoxBASE, FoxPRO and in ESRI's Shapefile format. See + {{{http://www.digitalpreservation.gov/formats/fdd/fdd000325.shtml} digitalpreservation.gov}} + for background on this format. + +* {Natural Language Processing} + + Tika supports calling out to a number of Natural Language Processing and + Named Entity Recognition frameworks, tools and libraries. + + These can be used to support additional formats, or to gain extra information on + existing formats. In many cases, additional tools or REST services or training + datasets are required to enable or power this support. + + Details on the requirements and setup steps are generally given either in + the parser's javadocs, or on the {{{https://wiki.apache.org/tika/}Tika wiki}}. + + The {{{./api/org/apache/tika/parser/sentiment/analysis/SentimentParser.html}SentimentParser}} + class classifies documents based on the sentiment of document, powered by Apache + OpenNLP's Maximum Entropy Classifier. + + {{{./api/org/apache/tika/parser/journal/JournalParser.html}JournalParser}} uses + Grobid (via RESTful server) to extract additional metadata from the text of + journal publications. A number of other NLP and NER parsers are available in the + {{{./api/org/apache/tika/parser/ner/}ner package}} + +* {Image and Video object recognition} + + Tika supports calling out to a number of Object Recognition frameworks to + analyse the contents of images and videos. Large training datasets and or + frameworks are generally required, often accessed via REST services. The + {{{./api/org/apache/tika/parser/recognition/}recognition package}} contains + most of these. Details on the requirements and setup steps are generally given + on the {{{https://wiki.apache.org/tika/}Tika wiki}}. + + +Full list of Supported Formats in "standard" artifacts + + * org.apache.tika.parser.apple.{{{./api/org/apache/tika/parser/apple/AppleSingleFileParser}AppleSingleFileParser}} + + * application/applefile + + * org.apache.tika.parser.apple.{{{./api/org/apache/tika/parser/apple/PListParser}PListParser}} + + * application/x-plist + + * application/x-bplist-itunes + + * application/x-bplist + + * application/x-bplist-memgraph + + * application/x-bplist-webarchive + + * org.apache.tika.parser.asm.{{{./api/org/apache/tika/parser/asm/ClassParser}ClassParser}} + + * application/java-vm + + * org.apache.tika.parser.audio.{{{./api/org/apache/tika/parser/audio/AudioParser}AudioParser}} + + * audio/vnd.wave + + * audio/x-wav + + * audio/basic + + * audio/x-aiff + + * org.apache.tika.parser.audio.{{{./api/org/apache/tika/parser/audio/MidiParser}MidiParser}} + + * application/x-midi + + * audio/midi + + * org.apache.tika.parser.code.{{{./api/org/apache/tika/parser/code/SourceCodeParser}SourceCodeParser}} + + * text/x-c++src + + * text/x-groovy + + * text/x-java-source + + * org.apache.tika.parser.crypto.{{{./api/org/apache/tika/parser/crypto/Pkcs7Parser}Pkcs7Parser}} + + * application/pkcs7-signature + + * application/pkcs7-mime + + * org.apache.tika.parser.crypto.{{{./api/org/apache/tika/parser/crypto/TSDParser}TSDParser}} + + * application/timestamped-data + + * org.apache.tika.parser.csv.{{{./api/org/apache/tika/parser/csv/TextAndCSVParser}TextAndCSVParser}} + + * text/csv + + * text/tsv + + * text/plain + + * org.apache.tika.parser.dbf.{{{./api/org/apache/tika/parser/dbf/DBFParser}DBFParser}} + + * application/x-dbf + + * org.apache.tika.parser.dgn.{{{./api/org/apache/tika/parser/dgn/DGN8Parser}DGN8Parser}} + + * image/vnd.dgn; version=8 + + * org.apache.tika.parser.dif.{{{./api/org/apache/tika/parser/dif/DIFParser}DIFParser}} + + * application/dif+xml + + * org.apache.tika.parser.dwg.{{{./api/org/apache/tika/parser/dwg/DWGParser}DWGParser}} + + * image/vnd.dwg + + * org.apache.tika.parser.epub.{{{./api/org/apache/tika/parser/epub/EpubParser}EpubParser}} + + * application/x-ibooks+zip + + * application/epub+zip + + * org.apache.tika.parser.executable.{{{./api/org/apache/tika/parser/executable/ExecutableParser}ExecutableParser}} + + * application/x-msdownload + + * application/x-sharedlib + + * application/x-elf + + * application/x-object + + * application/x-executable + + * application/x-coredump + + * org.apache.tika.parser.feed.{{{./api/org/apache/tika/parser/feed/FeedParser}FeedParser}} + + * application/atom+xml + + * application/rss+xml + + * org.apache.tika.parser.font.{{{./api/org/apache/tika/parser/font/AdobeFontMetricParser}AdobeFontMetricParser}} + + * application/x-font-adobe-metric + + * org.apache.tika.parser.font.{{{./api/org/apache/tika/parser/font/TrueTypeParser}TrueTypeParser}} + + * application/x-font-ttf + + * org.apache.tika.parser.html.{{{./api/org/apache/tika/parser/html/HtmlParser}HtmlParser}} + + * text/html + + * application/vnd.wap.xhtml+xml + + * application/x-asp + + * application/xhtml+xml + + * org.apache.tika.parser.http.{{{./api/org/apache/tika/parser/http/HttpParser}HttpParser}} + + * application/x-httpresponse + + * org.apache.tika.parser.hwp.{{{./api/org/apache/tika/parser/hwp/HwpV5Parser}HwpV5Parser}} + + * application/x-hwp-v5 + + * org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/BPGParser}BPGParser}} + + * image/bpg + + * image/x-bpg + + * org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/HeifParser}HeifParser}} + + * image/heic-sequence + + * image/heif + + * image/heic + + * image/heif-sequence + + * org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/ICNSParser}ICNSParser}} + + * image/icns + + * org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/ImageParser}ImageParser}} + + * image/png + + * image/vnd.wap.wbmp + + * image/x-jbig2 + + * image/bmp + + * image/x-xcf + + * image/gif + + * image/x-icon + + * image/x-ms-bmp + + * org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/JXLParser}JXLParser}} + + * image/jxl + + * org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/JpegParser}JpegParser}} + + * image/jpeg + + * org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/PSDParser}PSDParser}} + + * image/vnd.adobe.photoshop + + * org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/TiffParser}TiffParser}} + + * image/tiff + + * org.apache.tika.parser.image.{{{./api/org/apache/tika/parser/image/WebPParser}WebPParser}} + + * image/webp + + * org.apache.tika.parser.indesign.{{{./api/org/apache/tika/parser/indesign/IDMLParser}IDMLParser}} + + * application/vnd.adobe.indesign-idml-package + + * org.apache.tika.parser.iptc.{{{./api/org/apache/tika/parser/iptc/IptcAnpaParser}IptcAnpaParser}} + + * text/vnd.iptc.anpa + + * org.apache.tika.parser.iwork.{{{./api/org/apache/tika/parser/iwork/IWorkPackageParser}IWorkPackageParser}} + + * application/vnd.apple.keynote + + * application/vnd.apple.iwork + + * application/vnd.apple.numbers + + * application/vnd.apple.pages + + * org.apache.tika.parser.iwork.iwana.{{{./api/org/apache/tika/parser/iwork/iwana/IWork13PackageParser}IWork13PackageParser}} + + * application/vnd.apple.numbers.13 + + * application/vnd.apple.unknown.13 + + * application/vnd.apple.pages.13 + + * application/vnd.apple.keynote.13 + + * org.apache.tika.parser.iwork.iwana.{{{./api/org/apache/tika/parser/iwork/iwana/IWork18PackageParser}IWork18PackageParser}} + + * application/vnd.apple.pages.18 + + * application/vnd.apple.keynote.18 + + * application/vnd.apple.numbers.18 + + * org.apache.tika.parser.mail.{{{./api/org/apache/tika/parser/mail/RFC822Parser}RFC822Parser}} + + * message/rfc822 + + * org.apache.tika.parser.mat.{{{./api/org/apache/tika/parser/mat/MatParser}MatParser}} + + * application/x-matlab-data + + * org.apache.tika.parser.mbox.{{{./api/org/apache/tika/parser/mbox/MboxParser}MboxParser}} + + * application/mbox + + * org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/EMFParser}EMFParser}} + + * image/emf + + * org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/JackcessParser}JackcessParser}} + + * application/x-msaccess + + * org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/MSOwnerFileParser}MSOwnerFileParser}} + + * application/x-ms-owner + + * org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/OfficeParser}OfficeParser}} + + * application/x-tika-msoffice-embedded; format=ole10_native + + * application/msword + + * application/vnd.visio + + * application/x-tika-ole-drm-encrypted + + * application/vnd.ms-project + + * application/x-tika-msworks-spreadsheet + + * application/x-mspublisher + + * application/vnd.ms-powerpoint + + * application/x-tika-msoffice + + * application/sldworks + + * application/x-tika-ooxml-protected + + * application/vnd.ms-excel + + * application/vnd.ms-outlook + + * org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/OldExcelParser}OldExcelParser}} + + * application/vnd.ms-excel.workspace.3 + + * application/vnd.ms-excel.workspace.4 + + * application/vnd.ms-excel.sheet.2 + + * application/vnd.ms-excel.sheet.3 + + * application/vnd.ms-excel.sheet.4 + + * org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/TNEFParser}TNEFParser}} + + * application/vnd.ms-tnef + + * application/x-tnef + + * application/ms-tnef + + * org.apache.tika.parser.microsoft.{{{./api/org/apache/tika/parser/microsoft/WMFParser}WMFParser}} + + * image/wmf + + * org.apache.tika.parser.microsoft.chm.{{{./api/org/apache/tika/parser/microsoft/chm/ChmParser}ChmParser}} + + * application/vnd.ms-htmlhelp + + * application/x-chm + + * application/chm + + * org.apache.tika.parser.microsoft.onenote.{{{./api/org/apache/tika/parser/microsoft/onenote/OneNoteParser}OneNoteParser}} + + * application/onenote; format=one + + * org.apache.tika.parser.microsoft.ooxml.{{{./api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser}OOXMLParser}} + + * application/vnd.ms-powerpoint.template.macroenabled.12 + + * application/vnd.ms-excel.addin.macroenabled.12 + + * application/vnd.openxmlformats-officedocument.wordprocessingml.template + + * application/vnd.ms-excel.sheet.binary.macroenabled.12 + + * application/vnd.openxmlformats-officedocument.wordprocessingml.document + + * application/vnd.ms-powerpoint.slide.macroenabled.12 + + * application/vnd.ms-visio.drawing + + * application/vnd.ms-powerpoint.slideshow.macroenabled.12 + + * application/vnd.ms-powerpoint.presentation.macroenabled.12 + + * application/vnd.openxmlformats-officedocument.presentationml.slide + + * application/vnd.ms-excel.sheet.macroenabled.12 + + * application/vnd.ms-word.template.macroenabled.12 + + * application/vnd.ms-word.document.macroenabled.12 + + * application/vnd.ms-powerpoint.addin.macroenabled.12 + + * application/vnd.openxmlformats-officedocument.spreadsheetml.template + + * application/vnd.ms-xpsdocument + + * application/vnd.ms-visio.drawing.macroenabled.12 + + * application/vnd.ms-visio.template.macroenabled.12 + + * model/vnd.dwfx+xps + + * application/vnd.openxmlformats-officedocument.presentationml.template + + * application/vnd.openxmlformats-officedocument.presentationml.presentation + + * application/vnd.openxmlformats-officedocument.spreadsheetml.sheet + + * application/vnd.ms-visio.stencil + + * application/vnd.ms-visio.template + + * application/vnd.openxmlformats-officedocument.presentationml.slideshow + + * application/vnd.ms-visio.stencil.macroenabled.12 + + * application/vnd.ms-excel.template.macroenabled.12 + + * org.apache.tika.parser.microsoft.ooxml.xwpf.ml2006.{{{./api/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParser}Word2006MLParser}} + + * application/vnd.ms-word2006ml + + * org.apache.tika.parser.microsoft.pst.{{{./api/org/apache/tika/parser/microsoft/pst/OutlookPSTParser}OutlookPSTParser}} + + * application/vnd.ms-outlook-pst + + * org.apache.tika.parser.microsoft.rtf.{{{./api/org/apache/tika/parser/microsoft/rtf/RTFParser}RTFParser}} + + * application/rtf + + * org.apache.tika.parser.microsoft.xml.{{{./api/org/apache/tika/parser/microsoft/xml/SpreadsheetMLParser}SpreadsheetMLParser}} + + * application/vnd.ms-spreadsheetml + + * org.apache.tika.parser.microsoft.xml.{{{./api/org/apache/tika/parser/microsoft/xml/WordMLParser}WordMLParser}} + + * application/vnd.ms-wordml + + * org.apache.tika.parser.mif.{{{./api/org/apache/tika/parser/mif/MIFParser}MIFParser}} + + * application/x-mif + + * application/vnd.mif + + * application/x-maker + + * org.apache.tika.parser.mp3.{{{./api/org/apache/tika/parser/mp3/Mp3Parser}Mp3Parser}} + + * audio/mpeg + + * org.apache.tika.parser.mp4.{{{./api/org/apache/tika/parser/mp4/MP4Parser}MP4Parser}} + + * video/x-m4v + + * application/mp4 + + * video/3gpp + + * video/3gpp2 + + * video/quicktime + + * audio/mp4 + + * video/mp4 + + * org.apache.tika.parser.ocr.{{{./api/org/apache/tika/parser/ocr/TesseractOCRParser}TesseractOCRParser}} + + * image/ocr-x-portable-pixmap + + * image/ocr-jpx + + * image/x-portable-pixmap + + * image/ocr-jpeg + + * image/ocr-jp2 + + * image/jpx + + * image/ocr-png + + * image/ocr-tiff + + * image/ocr-gif + + * image/ocr-bmp + + * image/jp2 + + * org.apache.tika.parser.odf.{{{./api/org/apache/tika/parser/odf/FlatOpenDocumentParser}FlatOpenDocumentParser}} + + * application/vnd.oasis.opendocument.tika.flat.document + + * application/vnd.oasis.opendocument.flat.presentation + + * application/vnd.oasis.opendocument.flat.spreadsheet + + * application/vnd.oasis.opendocument.flat.text + + * org.apache.tika.parser.odf.{{{./api/org/apache/tika/parser/odf/OpenDocumentParser}OpenDocumentParser}} + + * application/x-vnd.oasis.opendocument.presentation + + * application/vnd.oasis.opendocument.chart + + * application/x-vnd.oasis.opendocument.text-web + + * application/x-vnd.oasis.opendocument.image + + * application/vnd.oasis.opendocument.graphics-template + + * application/vnd.oasis.opendocument.text-web + + * application/x-vnd.oasis.opendocument.spreadsheet-template + + * application/vnd.oasis.opendocument.spreadsheet-template + + * application/vnd.sun.xml.writer + + * application/x-vnd.oasis.opendocument.graphics-template + + * application/vnd.oasis.opendocument.graphics + + * application/vnd.oasis.opendocument.spreadsheet + + * application/x-vnd.oasis.opendocument.chart + + * application/x-vnd.oasis.opendocument.spreadsheet + + * application/vnd.oasis.opendocument.image + + * application/x-vnd.oasis.opendocument.text + + * application/x-vnd.oasis.opendocument.text-template + + * application/vnd.oasis.opendocument.formula-template + + * application/x-vnd.oasis.opendocument.formula + + * application/vnd.oasis.opendocument.image-template + + * application/x-vnd.oasis.opendocument.image-template + + * application/x-vnd.oasis.opendocument.presentation-template + + * application/vnd.oasis.opendocument.presentation-template + + * application/vnd.oasis.opendocument.text + + * application/vnd.oasis.opendocument.text-template + + * application/vnd.oasis.opendocument.chart-template + + * application/x-vnd.oasis.opendocument.chart-template + + * application/x-vnd.oasis.opendocument.formula-template + + * application/x-vnd.oasis.opendocument.text-master + + * application/vnd.oasis.opendocument.presentation + + * application/x-vnd.oasis.opendocument.graphics + + * application/vnd.oasis.opendocument.formula + + * application/vnd.oasis.opendocument.text-master + + * org.apache.tika.parser.pdf.{{{./api/org/apache/tika/parser/pdf/PDFParser}PDFParser}} + + * application/pdf + + * org.apache.tika.parser.pkg.{{{./api/org/apache/tika/parser/pkg/CompressorParser}CompressorParser}} + + * application/zlib + + * application/x-gzip + + * application/x-bzip2 + + * application/x-compress + + * application/x-java-pack200 + + * application/x-lzma + + * application/deflate64 + + * application/x-lz4 + + * application/x-snappy + + * application/x-brotli + + * application/gzip + + * application/x-bzip + + * application/x-xz + + * org.apache.tika.parser.pkg.{{{./api/org/apache/tika/parser/pkg/PackageParser}PackageParser}} + + * application/x-tar + + * application/java-archive + + * application/x-arj + + * application/x-archive + + * application/zip + + * application/x-cpio + + * application/x-tika-unix-dump + + * application/x-7z-compressed + + * org.apache.tika.parser.pkg.{{{./api/org/apache/tika/parser/pkg/RarParser}RarParser}} + + * application/x-rar-compressed + + * org.apache.tika.parser.prt.{{{./api/org/apache/tika/parser/prt/PRTParser}PRTParser}} + + * application/x-prt + + * org.apache.tika.parser.sas.{{{./api/org/apache/tika/parser/sas/SAS7BDATParser}SAS7BDATParser}} + + * application/x-sas-data + + * org.apache.tika.parser.tmx.{{{./api/org/apache/tika/parser/tmx/TMXParser}TMXParser}} + + * application/x-tmx + + * org.apache.tika.parser.video.{{{./api/org/apache/tika/parser/video/FLVParser}FLVParser}} + + * video/x-flv + + * org.apache.tika.parser.wacz.{{{./api/org/apache/tika/parser/wacz/WACZParser}WACZParser}} + + * application/x-wacz + + * org.apache.tika.parser.warc.{{{./api/org/apache/tika/parser/warc/WARCParser}WARCParser}} + + * application/warc + + * org.apache.tika.parser.wordperfect.{{{./api/org/apache/tika/parser/wordperfect/QuattroProParser}QuattroProParser}} + + * application/x-quattro-pro; version=9 + + * org.apache.tika.parser.wordperfect.{{{./api/org/apache/tika/parser/wordperfect/WordPerfectParser}WordPerfectParser}} + + * application/vnd.wordperfect; version=5.1 + + * application/vnd.wordperfect; version=5.0 + + * application/vnd.wordperfect; version=6.x + + * org.apache.tika.parser.xliff.{{{./api/org/apache/tika/parser/xliff/XLIFF12Parser}XLIFF12Parser}} + + * application/x-xliff+xml + + * org.apache.tika.parser.xliff.{{{./api/org/apache/tika/parser/xliff/XLZParser}XLZParser}} + + * application/x-xliff+zip + + * org.apache.tika.parser.xml.{{{./api/org/apache/tika/parser/xml/DcXMLParser}DcXMLParser}} + + * application/xml + + * image/svg+xml + + * org.apache.tika.parser.xml.{{{./api/org/apache/tika/parser/xml/FictionBookParser}FictionBookParser}} + + * application/x-fictionbook+xml + + * org.gagravarr.tika.{{{./api/org/gagravarr/tika/FlacParser}FlacParser}} + + * audio/x-oggflac + + * audio/x-flac + + * org.gagravarr.tika.{{{./api/org/gagravarr/tika/OggParser}OggParser}} + + * audio/ogg + + * application/kate + + * application/ogg + + * video/daala + + * video/x-ogguvs + + * video/x-ogm + + * audio/x-oggpcm + + * video/ogg + + * video/x-dirac + + * video/x-oggrgb + + * video/x-oggyuv + + * org.gagravarr.tika.{{{./api/org/gagravarr/tika/OpusParser}OpusParser}} + + * audio/opus + + * audio/ogg; codecs=opus + + * org.gagravarr.tika.{{{./api/org/gagravarr/tika/SpeexParser}SpeexParser}} + + * audio/ogg; codecs=speex + + * audio/speex + + * org.gagravarr.tika.{{{./api/org/gagravarr/tika/TheoraParser}TheoraParser}} + + * video/theora + + * org.gagravarr.tika.{{{./api/org/gagravarr/tika/VorbisParser}VorbisParser}} + + * audio/vorbis + Added: tika/site/src/site/apt/2.6.0/gettingstarted.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/gettingstarted.apt?rev=1905121&view=auto ============================================================================== --- tika/site/src/site/apt/2.6.0/gettingstarted.apt (added) +++ tika/site/src/site/apt/2.6.0/gettingstarted.apt Mon Nov 7 11:40:42 2022 @@ -0,0 +1,324 @@ + -------------------------------- + Getting Started with Apache Tika + -------------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Getting Started with Apache Tika + + This document describes how to build Apache Tika from sources and + how to start using Tika in an application. + +Getting and building the sources + + To build Tika from sources you first need to either + {{{../download.html}download}} a source release or + {{{../contribute.html#Source_Code}checkout}} the latest sources from + version control. + + Once you have the sources, you can build them using the + {{{http://maven.apache.org/}Maven 2}} build system. Executing the + following command in the base directory will build the sources + and install the resulting artifacts in your local Maven repository. + +--- +mvn install +--- + + If you want to build only the app or the server with the standard parsers, + you can save time with: + +--- +mvn install -am -pl :tika-app +--- + Or: + +--- +mvn install -am -pl :tika-server-standard +--- + + See the Maven documentation for more information about the available + build options. + + Note that you need Java 8 or higher to build Tika. For a full build, you'll also need to have Docker installed. + +Build artifacts + + The Tika build consists of a number of components and produces + the following main binaries: + + [tika-core/target/tika-core-*.jar] + Tika core library. Contains the core interfaces and classes of Tika, + but none of the parser implementations. + + [tika-parsers/tika-parsers-standard/tika-parsers-standard-package/target/tika-parsers-standard-package-*.jar] + Tika parsers. Collection of classes that implement the Tika Parser + interface based on various external parser libraries. This includes + the most commonly used parsers. Users may want to add <<<tika-parser-sqlite3-package>>> + and <<<tika-parser-scientific-package>>> or other parser modules. + + [tika-app/target/tika-app-*.jar] + Tika application. Combines the above components and the standard + parser libraries into a single runnable jar with a GUI and a command + line interface. + + [tika-server/tika-server-standard/target/tika-server-standard-*.jar] + Tika JAX-RS REST application. This is a Jetty web server running Tika + REST services with the parsers in tika-parsers-standard-package + as described in {{{https://cwiki.apache.org/confluence/display/TIKA/TikaServer}this page}}. + + [tika-bundles/tika-bundle-standard/target/tika-bundle-standard-*.jar] + Tika bundle. An OSGi bundle that combines tika-parsers with non-OSGified + parser libraries to make them easy to deploy in an OSGi environment. + + [tika-eval/tika-eval-app/target/tika-eval-app-*.jar] + Tika eval module. Commandline tool to assess the output of Tika + or compare the output of two different versions of Tika or + other text extraction packages. + + + +Using Tika as a Maven dependency + + The core library, <<<tika-core>>>, contains the key interfaces and classes + of Tika and can be used by itself if you don't need the full set of parsers + from the <<< tika-parsers >>> component. The tika-core dependency looks like + this: + +--- + <dependency> + <groupId>org.apache.tika</groupId> + <artifactId>tika-core</artifactId> + <version>2.6.0</version> + </dependency> +--- + + If you want to use Tika to parse documents (instead of simply detecting + document types, etc.), you'll want to add a dependency on at least + <<< tika-parsers-standard-package >>>: + +--- + <dependency> + <groupId>org.apache.tika</groupId> + <artifactId>tika-parsers-standard-package</artifactId> + <version>2.6.0</version> + </dependency> +--- + + Note that adding this dependency will introduce a number of + transitive dependencies to your project. + You need to make sure that these dependencies won't conflict with your + existing project dependencies. You can use the following command in + the tika-parsers-standard-package directory to get a full listing of all the dependencies. + +--- +$ mvn dependency:tree | grep :compile +--- + + You may also want to add one or more of the following dependencies: + +--- + <dependency> + <groupId>org.apache.tika</groupId> + <artifactId>tika-parser-sqlite3-package</artifactId> + <version>2.6.0</version> + </dependency> + <dependency> + <groupId>org.apache.tika</groupId> + <artifactId>tika-parser-scientific-package</artifactId> + <version>2.6.0</version> + </dependency> +--- + + You may also consider adding dependencies on modules under the <<<tika-parsers-ml>>> module. + +Using Tika in a Gradle-built project + + To add a dependency on Apache Tika to your Gradle built project, + including the full set of parsers, you should depend on the + <<< tika-core >>> artifact and the + <<< tika-parsers-standard-package >>> artifact: + +--- +dependencies { + runtime 'org.apache.tika:tika-core:2.6.0' + runtime 'org.apache.tika:tika-parsers-standard-package:2.6.0' +} +--- + +Using Tika in an Ant project + + If you are using {{{http://ant.apache.org/ivy/}Apache Ivy}} as your + dependency manager tool with Ant, then to include Tika with the full set + of parsers, you should depend on the <<< tika-parsers >>> artifact like this: + +--- + <dependencies> + <dependency org="org.apache.tika" name="tika-core" rev="2.6.0"/> + <dependency org="org.apache.tika" name="tika-parsers-standard-package" rev="2.6.0"/> + </dependencies> +--- + + Otherwise, probably the easiest way to use Tika is to include the full + <<< tika-app >>> jar on your classpath. For just core functionality, you + can add the <<< tika-core >>> jar, but be aware that the full set of + parsers have a large number of dependencies which must be included which + is very fiddly to do by hand with Ant! To include Tika in your Ant project, + you should do something like: + +--- +<classpath> + ... <!-- your other classpath entries --> + + <!-- either: Tika Core only, no parsers --> + <pathelement location="path/to/tika-core-2.6.0.jar"/> + <!-- or: Tika with all Parsers--> + <pathelement location="path/to/tika-app-2.6.0.jar"/> + +</classpath> +--- + +Using Tika as a command line utility + + The Tika application jar (tika-app-*.jar) can be used as a command + line utility for extracting text content and metadata from all sorts of + files. This runnable jar contains all the dependencies it needs, so + you don't need to worry about classpath settings to run it. + + The usage instructions are shown below. + +--- +usage: java -jar tika-app.jar [option...] [file|port...] + +Options: + -? or --help Print this usage message + -v or --verbose Print debug level messages + -V or --version Print the Apache Tika version number + + -g or --gui Start the Apache Tika GUI + -s or --server Start the Apache Tika server + -f or --fork Use Fork Mode for out-of-process extraction + + --config=<tika-config.xml> + TikaConfig file. Must be specified before -g, -s, -f or the dump-x-config ! + --dump-minimal-config Print minimal TikaConfig + --dump-current-config Print current TikaConfig + --dump-static-config Print static config + --dump-static-full-config Print static explicit config + + -x or --xml Output XHTML content (default) + -h or --html Output HTML content + -t or --text Output plain text content + -T or --text-main Output plain text content (main content only) + -m or --metadata Output only metadata + -j or --json Output metadata in JSON + -y or --xmp Output metadata in XMP + -J or --jsonRecursive Output metadata and content from all + embedded files (choose content type + with -x, -h, -t or -m; default is -x) + -l or --language Output only language + -d or --detect Detect document type + --digest=X Include digest X (md2, md5, sha1, + sha256, sha384, sha512 + -eX or --encoding=X Use output encoding X + -pX or --password=X Use document password X + -z or --extract Extract all attachements into current directory + --extract-dir=<dir> Specify target directory for -z + -r or --pretty-print For JSON, XML and XHTML outputs, adds newlines and + whitespace, for better readability + + --list-parsers + List the available document parsers + --list-parser-details + List the available document parsers and their supported mime types + --list-parser-details-apt + List the available document parsers and their supported mime types in apt format. + --list-detectors + List the available document detectors + --list-met-models + List the available metadata models, and their supported keys + --list-supported-types + List all known media types and related information + + + --compare-file-magic=<dir> + Compares Tika's known media types to the File(1) tool's magic directory + +Description: + Apache Tika will parse the file(s) specified on the + command line and output the extracted text content + or metadata to standard output. + + Instead of a file name you can also specify the URL + of a document to be parsed. + + If no file name or URL is specified (or the special + name "-" is used), then the standard input stream + is parsed. If no arguments were given and no input + data is available, the GUI is started instead. + +- GUI mode + + Use the "--gui" (or "-g") option to start the + Apache Tika GUI. You can drag and drop files from + a normal file explorer to the GUI window to extract + text content and metadata from the files. + +- Batch mode + + Simplest method. + Specify two directories as args with no other args: + java -jar tika-app.jar <inputDirectory> <outputDirectory> + + +Batch Options: + -i or --inputDir Input directory + -o or --outputDir Output directory + -numConsumers Number of processing threads + -bc Batch config file + -maxRestarts Maximum number of times the + watchdog process will restart the child process. + -timeoutThresholdMillis Number of milliseconds allowed to a parse + before the process is killed and restarted + -fileList List of files to process, with + paths relative to the input directory + -includeFilePat Regular expression to determine which + files to process, e.g. "(?i)\.pdf" + -excludeFilePat Regular expression to determine which + files to avoid processing, e.g. "(?i)\.pdf" + -maxFileSizeBytes Skip files longer than this value + + Control the type of output with -x, -h, -t and/or -J. + + To modify child process jvm args, prepend "J" as in: + -JXmx4g or -JDlog4j.configuration=file:log4j.xml. + +--- + + You can also use the jar as a component in a Unix pipeline or + as an external tool in many scripting languages. + +--- +# Check if an Internet resource contains a specific keyword +curl http://.../document.doc \ + | java -jar tika-app.jar --text \ + | grep -q keyword +--- + +Wrappers + + Several wrappers are available to use Tika in another programming language, + such as {{{https://github.com/aviks/Taro.jl}Julia}} or {{{https://github.com/chrismattmann/tika-python}Python}}. Added: tika/site/src/site/apt/2.6.0/index.apt URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/2.6.0/index.apt?rev=1905121&view=auto ============================================================================== --- tika/site/src/site/apt/2.6.0/index.apt (added) +++ tika/site/src/site/apt/2.6.0/index.apt Mon Nov 7 11:40:42 2022 @@ -0,0 +1,53 @@ + ---------------- + Apache Tika 1.27 + ---------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + + +Apache Tika 2.6.0 + + The most notable changes in Tika 2.6.0 over the previous release are: + + * Add optional Siegfried detector ({{{http://issues.apache.org/jira/browse/TIKA-3901}TIKA-3901}}). + + * Move OverrideDetector's functionality to the CompositeDetector ({{{http://issues.apache.org/jira/browse/TIKA-3904}TIKA-3904}}). + + * The FileCommandDetector has been refactored to have the same behavior as the Siegfried detector; see setUseMime in the javadoc ({{{http://issues.apache.org/jira/browse/TIKA-3902}TIKA-3902}}). + + * Fix bug in OpenSearch emitter that prevented upserts on documents with embedded files ({{{http://issues.apache.org/jira/browse/TIKA-3882}TIKA-3882}}). + + * Extract PDF actions and triggers into the file's metadata ({{{http://issues.apache.org/jira/browse/TIKA-3887}TIKA-3887}}). + + * Add a tika-async-cli module ({{{http://issues.apache.org/jira/browse/TIKA-3885}TIKA-3885}}). + + + The following people have contributed to Tika 2.6.0 by submitting or + commenting on the issues resolved in this release: + + * Dave Meikle + + * Ethan Wilansky + + * Luca Perico + + * Tilman Hausherr + + * Tim Allison + + * Tong Wang + + See {{https://s.apache.org/zrcax}} for more details on these contributions.