Author: mattmann
Date: Thu Nov 10 22:57:46 2016
New Revision: 1769222
URL: http://svn.apache.org/viewvc?rev=1769222&view=rev
Log:
Getting started: Apache Tika 1.14
Added:
tika/site/src/site/apt/1.14/gettingstarted.apt
Added: tika/site/src/site/apt/1.14/gettingstarted.apt
URL:
http://svn.apache.org/viewvc/tika/site/src/site/apt/1.14/gettingstarted.apt?rev=1769222&view=auto
==============================================================================
--- tika/site/src/site/apt/1.14/gettingstarted.apt (added)
+++ tika/site/src/site/apt/1.14/gettingstarted.apt Thu Nov 10 22:57:46 2016
@@ -0,0 +1,241 @@
+ --------------------------------
+ Getting Started with Apache Tika
+ --------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Getting Started with Apache Tika
+
+ This document describes how to build Apache Tika from sources and
+ how to start using Tika in an application.
+
+Getting and building the sources
+
+ To build Tika from sources you first need to either
+ {{{../download.html}download}} a source release or
+ {{{../source-repository.html}checkout}} the latest sources from
+ version control.
+
+ Once you have the sources, you can build them using the
+ {{{http://maven.apache.org/}Maven 2}} build system. Executing the
+ following command in the base directory will build the sources
+ and install the resulting artifacts in your local Maven repository.
+
+---
+mvn install
+---
+
+ See the Maven documentation for more information about the available
+ build options.
+
+ Note that you need Java 7 or higher to build Tika.
+
+Build artifacts
+
+ The Tika build consists of a number of components and produces
+ the following main binaries:
+
+ [tika-core/target/tika-core-*.jar]
+ Tika core library. Contains the core interfaces and classes of Tika,
+ but none of the parser implementations. Depends only on Java 6.
+
+ [tika-parsers/target/tika-parsers-*.jar]
+ Tika parsers. Collection of classes that implement the Tika Parser
+ interface based on various external parser libraries.
+
+ [tika-app/target/tika-app-*.jar]
+ Tika application. Combines the above components and all the external
+ parser libraries into a single runnable jar with a GUI and a command
+ line interface.
+
+ [tika-server/target/tika-server-*.jar]
+ Tika JAX-RS REST application. This is a Jetty web server running Tika
+ REST services as described in {{{http://wiki.apache.org/tika/TikaJAXRS}this
page}}.
+
+ [tika-bundle/target/tika-bundle-*.jar]
+ Tika bundle. An OSGi bundle that combines tika-parsers with non-OSGified
+ parser libraries to make them easy to deploy in an OSGi environment.
+
+Using Tika as a Maven dependency
+
+ The core library, <<< tika-core >>>, contains the key interfaces and classes
+ of Tika and can be used by itself if you don't need the full set of parsers
+ from the <<< tika-parsers >>> component. The tika-core dependency looks like
+ this:
+
+---
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-core</artifactId>
+ <version>1.14</version>
+ </dependency>
+---
+
+ If you want to use Tika to parse documents (instead of simply detecting
+ document types, etc.), you'll want to depend on <<< tika-parsers >>> instead:
+
+---
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-parsers</artifactId>
+ <version>1.14</version>
+ </dependency>
+---
+
+ Note that adding this dependency will introduce a number of
+ transitive dependencies to your project, including one on tika-core.
+ You need to make sure that these dependencies won't conflict with your
+ existing project dependencies. You can use the following command in
+ the tika-parsers directory to get a full listing of all the dependencies.
+
+---
+$ mvn dependency:tree | grep :compile
+---
+
+Using Tika in a Gradle-built project
+
+ To add a dependency on Apache Tika to your Gradle built project,
+ including the full set of parsers, you should depend on the
+ <<< tika-parsers >>> artifact:
+
+---
+dependencies {
+ runtime 'org.apache.tika:tika-parsers:1.14'
+}
+---
+
+Using Tika in an Ant project
+
+ If you are using {{{http://ant.apache.org/ivy/}Apache Ivy}} as your
+ dependency manager tool with Ant, then to include Tika with the full set
+ of parsers, you should depend on the <<< tika-parsers >>> artifact like this:
+
+---
+ <dependencies>
+ <dependency org="org.apache.tika" name="tika-parsers" rev="1.14"/>
+ </dependencies>
+---
+
+ Otherwise, probably the easiest way to use Tika is to include the full
+ <<< tika-app >>> jar on your classpath. For just core functionality, you
+ can add the <<< tika-core >>> jar, but be aware that the full set of
+ parsers have a large number of dependencies which must be included which
+ is very fiddly to do by hand with Ant! To include Tika in your Ant project,
+ you should do something like:
+
+---
+<classpath>
+ ... <!-- your other classpath entries -->
+
+ <!-- either: Tika Core only, no parsers -->
+ <pathelement location="path/to/tika-core-${tika.version}.jar"/>
+ <!-- or: Tika with all Parsers-->
+ <pathelement location="path/to/tika-app-${tika.version}.jar"/>
+
+</classpath>
+---
+
+Using Tika as a command line utility
+
+ The Tika application jar (tika-app-*.jar) can be used as a command
+ line utility for extracting text content and metadata from all sorts of
+ files. This runnable jar contains all the dependencies it needs, so
+ you don't need to worry about classpath settings to run it.
+
+ The usage instructions are shown below.
+
+---
+usage: java -jar tika-app.jar [option...] [file|port...]
+
+Options:
+ -? or --help Print this usage message
+ -v or --verbose Print debug level messages
+ -V or --version Print the Apache Tika version number
+
+ -g or --gui Start the Apache Tika GUI
+ -s or --server Start the Apache Tika server
+ -f or --fork Use Fork Mode for out-of-process extraction
+
+ -x or --xml Output XHTML content (default)
+ -h or --html Output HTML content
+ -t or --text Output plain text content
+ -T or --text-main Output plain text content (main content only)
+ -m or --metadata Output only metadata
+ -j or --json Output metadata in JSON
+ -y or --xmp Output metadata in XMP
+ -l or --language Output only language
+ -d or --detect Detect document type
+ -eX or --encoding=X Use output encoding X
+ -pX or --password=X Use document password X
+ -z or --extract Extract all attachements into current directory
+ --extract-dir=<dir> Specify target directory for -z
+ -r or --pretty-print For XML and XHTML outputs, adds newlines and
+ whitespace, for better readability
+
+ --create-profile=X
+ Create NGram profile, where X is a profile name
+ --list-parsers
+ List the available document parsers
+ --list-parser-details
+ List the available document parsers, and their supported mime types
+ --list-detectors
+ List the available document detectors
+ --list-met-models
+ List the available metadata models, and their supported keys
+ --list-supported-types
+ List all known media types and related information
+
+Description:
+ Apache Tika will parse the file(s) specified on the
+ command line and output the extracted text content
+ or metadata to standard output.
+
+ Instead of a file name you can also specify the URL
+ of a document to be parsed.
+
+ If no file name or URL is specified (or the special
+ name "-" is used), then the standard input stream
+ is parsed. If no arguments were given and no input
+ data is available, the GUI is started instead.
+
+- GUI mode
+
+ Use the "--gui" (or "-g") option to start the
+ Apache Tika GUI. You can drag and drop files from
+ a normal file explorer to the GUI window to extract
+ text content and metadata from the files.
+
+- Server mode
+
+ Use the "--server" (or "-s") option to start the
+ Apache Tika server. The server will listen to the
+ ports you specify as one or more arguments.
+---
+
+ You can also use the jar as a component in a Unix pipeline or
+ as an external tool in many scripting languages.
+
+---
+# Check if an Internet resource contains a specific keyword
+curl http://.../document.doc \
+ | java -jar tika-app.jar --text \
+ | grep -q keyword
+---
+
+Wrappers
+
+ Several wrappers are available to use Tika in another programming language,
+ such as {{{https://github.com/aviks/Taro.jl}Julia}} or
{{{https://github.com/chrismattmann/tika-python}Python}}.