This is an automated email from the ASF dual-hosted git repository. tballison pushed a commit to branch docs/pipes-updates in repository https://gitbox.apache.org/repos/asf/tika.git
commit 9cf2de2c3a170cdbbbdaae2b6480b56d81290c94 Author: tallison <[email protected]> AuthorDate: Mon May 11 11:43:39 2026 -0400 add file system docs --- docs/modules/ROOT/examples/pipes-fs-pipeline.json | 2 +- docs/modules/ROOT/nav.adoc | 2 + docs/modules/ROOT/pages/pipes/getting-started.adoc | 4 +- docs/modules/ROOT/pages/pipes/plugins/index.adoc | 133 +++++++++++++++++++++ 4 files changed, 139 insertions(+), 2 deletions(-) diff --git a/docs/modules/ROOT/examples/pipes-fs-pipeline.json b/docs/modules/ROOT/examples/pipes-fs-pipeline.json index 5a7538b141..4b71666add 120000 --- a/docs/modules/ROOT/examples/pipes-fs-pipeline.json +++ b/docs/modules/ROOT/examples/pipes-fs-pipeline.json @@ -1 +1 @@ -../../../../tika-pipes/tika-pipes-plugins/tika-pipes-file-system/src/test/resources/config-examples/file-system-pipeline.json \ No newline at end of file +../../../../tika-pipes/tika-pipes-integration-tests/src/test/resources/configs/tika-config-basic.json \ No newline at end of file diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc index 979555022a..ef16b190dd 100644 --- a/docs/modules/ROOT/nav.adoc +++ b/docs/modules/ROOT/nav.adoc @@ -31,6 +31,8 @@ ** xref:pipes/unpack-config.adoc[Extracting Embedded Bytes] ** xref:pipes/timeouts.adoc[Timeouts] ** xref:pipes/cpu-sizing.adoc[Forked-JVM CPU Sizing] +** xref:pipes/plugins/index.adoc[Plugins] +*** xref:pipes/plugins/filesystem.adoc[File System] * xref:configuration/index.adoc[Configuration] ** xref:configuration/parsers/pdf-parser.adoc[PDF Parser] ** xref:configuration/parsers/tesseract-ocr-parser.adoc[Tesseract OCR] diff --git a/docs/modules/ROOT/pages/pipes/getting-started.adoc b/docs/modules/ROOT/pages/pipes/getting-started.adoc index 6ee6c45148..e52e02f1ac 100644 --- a/docs/modules/ROOT/pages/pipes/getting-started.adoc +++ b/docs/modules/ROOT/pages/pipes/getting-started.adoc @@ -64,7 +64,9 @@ pipeline: ---- include::example$pipes-fs-pipeline.json[] ---- -icon:github[] https://github.com/apache/tika/blob/main/tika-pipes/tika-pipes-plugins/tika-pipes-file-system/src/test/resources/config-examples/file-system-pipeline.json[View source on GitHub] +icon:github[] https://github.com/apache/tika/blob/main/tika-pipes/tika-pipes-integration-tests/src/test/resources/configs/tika-config-basic.json[View source on GitHub] + +NOTE: The values shown like `FETCHER_BASE_PATH`, `EMITTER_BASE_PATH`, and `PLUGINS_PATHS` are placeholders the integration tests substitute at runtime. Replace them with real paths in your own config. Run it with: diff --git a/docs/modules/ROOT/pages/pipes/plugins/index.adoc b/docs/modules/ROOT/pages/pipes/plugins/index.adoc new file mode 100644 index 0000000000..8542fa2034 --- /dev/null +++ b/docs/modules/ROOT/pages/pipes/plugins/index.adoc @@ -0,0 +1,133 @@ +// +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +// + += Pipes Plugins + +Tika Pipes is extensible through plugins. Each plugin lives in its own Maven module and can implement one or more of the four pipes extension points: + +* **Fetcher** — retrieves document bytes from a source. +* **Emitter** — writes parsed results to a destination. +* **Iterator** (`PipesIterator`) — enumerates documents to process as `FetchEmitTuple` records. +* **Reporter** (`PipesReporter`) — records per-document processing status. + +Many plugins implement more than one (e.g., the S3 plugin provides fetcher, emitter, and iterator). The pages below document each plugin once, with one section per implemented interface. + +== Plugin / Interface Matrix + +[cols="2,1,1,1,1"] +|=== +|Plugin |Fetcher |Emitter |Iterator |Reporter + +|xref:pipes/plugins/filesystem.adoc[File System] +|✓ +|✓ +|✓ +|✓ + +|xref:pipes/plugins/s3.adoc[Amazon S3] +|✓ +|✓ +|✓ +|— + +|xref:pipes/plugins/gcs.adoc[Google Cloud Storage] +|✓ +|✓ +|✓ +|— + +|xref:pipes/plugins/azblob.adoc[Azure Blob Storage] +|✓ +|✓ +|✓ +|— + +|xref:pipes/plugins/opensearch.adoc[OpenSearch] +|— +|✓ +|— +|✓ + +|xref:pipes/plugins/elasticsearch.adoc[Elasticsearch] +|— +|✓ +|— +|✓ + +|xref:pipes/plugins/solr.adoc[Solr] +|— +|✓ +|✓ +|— + +|xref:pipes/plugins/jdbc.adoc[JDBC] +|— +|✓ +|✓ +|✓ + +|xref:pipes/plugins/kafka.adoc[Kafka] +|— +|✓ +|✓ +|— + +|xref:pipes/plugins/http.adoc[HTTP] +|✓ +|— +|— +|— + +|xref:pipes/plugins/google-drive.adoc[Google Drive] +|✓ +|— +|— +|— + +|xref:pipes/plugins/microsoft-graph.adoc[Microsoft Graph] +|✓ +|— +|— +|— + +|xref:pipes/plugins/atlassian-jwt.adoc[Atlassian JWT] +|✓ +|— +|— +|— + +|xref:pipes/plugins/csv.adoc[CSV] +|— +|— +|✓ +|— + +|xref:pipes/plugins/json.adoc[JSON] +|— +|— +|✓ +|— +|=== + +== Interface Overviews + +For descriptions of the interfaces themselves — their contracts, the shared concepts (`FetchKey`, `FetchEmitTuple`, `baseConfig`, etc.), and how they fit into a pipeline — see: + +* xref:pipes/fetchers.adoc[Fetchers] +* xref:pipes/emitters.adoc[Emitters] +* xref:pipes/iterators.adoc[Pipes Iterators] +* xref:pipes/reporters.adoc[Pipes Reporters]
