This is an automated email from the ASF dual-hosted git repository. tballison pushed a commit to branch docs/pipes-updates in repository https://gitbox.apache.org/repos/asf/tika.git
commit 87b5cc2bbc386e7bbbeff815570f5f4efed042f9 Author: tallison <[email protected]> AuthorDate: Mon May 11 11:43:31 2026 -0400 add file system docs --- .../ROOT/pages/pipes/plugins/filesystem.adoc | 255 +++++++++++++++++++++ 1 file changed, 255 insertions(+) diff --git a/docs/modules/ROOT/pages/pipes/plugins/filesystem.adoc b/docs/modules/ROOT/pages/pipes/plugins/filesystem.adoc new file mode 100644 index 0000000000..85fba5889e --- /dev/null +++ b/docs/modules/ROOT/pages/pipes/plugins/filesystem.adoc @@ -0,0 +1,255 @@ +// +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. +// + += File System Plugin +:toc: +:toclevels: 3 + +The File System plugin (`tika-pipes-file-system`) is the most common starting point for Tika Pipes. It provides all four interfaces — fetcher, emitter, iterator, and reporter — backed by the local (or mounted) filesystem. + +[cols="2,1,3"] +|=== +|Interface |Component name |Class + +|Fetcher +|`file-system-fetcher` +|`FileSystemFetcher` + +|Emitter +|`file-system-emitter` +|`FileSystemEmitter` + +|Iterator +|`file-system-pipes-iterator` +|`FileSystemPipesIterator` + +|Reporter +|`file-system-reporter` +|`FileSystemStatusReporter` +|=== + +== Complete Pipeline Example + +The example below is the canonical filesystem-to-filesystem integration test config. Tokens like `FETCHER_BASE_PATH`, `EMITTER_BASE_PATH`, and `PLUGINS_PATHS` are placeholders the test harness substitutes; replace them with real paths in your own config. + +[source,json,subs=none] +---- +include::example$pipes-fs-pipeline.json[] +---- + +icon:github[] https://github.com/apache/tika/blob/main/tika-pipes/tika-pipes-integration-tests/src/test/resources/configs/tika-config-basic.json[View source on GitHub] + +[#file-system-fetcher] +== File System Fetcher (`file-system-fetcher`) + +Reads files from a local or mounted filesystem. Fetch keys are resolved relative to `basePath`. + +[source,json] +---- +{ + "fetchers": { + "fsf": { + "file-system-fetcher": { + "basePath": "/data/input", + "extractFileSystemMetadata": true + } + } + } +} +---- + +The outer key (`fsf`) is the fetcher ID — referenced by `pipesIterator.fetcherId` elsewhere in the config. + +=== Configuration + +[cols="1,1,3"] +|=== +|Field |Default |Description + +|`basePath` +|_required_ +|Base directory for fetch operations. Fetch keys are resolved relative to this path. + +|`extractFileSystemMetadata` +|`false` +|When `true`, attach file size, created, and modified timestamps to the metadata of each fetched document. + +|`allowAbsolutePaths` +|`false` +|When `true`, fetch keys may be absolute paths and `basePath` may be omitted. Use sparingly — see <<security-notes>>. +|=== + +[#file-system-emitter] +== File System Emitter (`file-system-emitter`) + +Writes parsed results as files under `basePath`. The relative output path is derived from the emit key of each `FetchEmitTuple`. + +[source,json] +---- +{ + "emitters": { + "fse": { + "file-system-emitter": { + "basePath": "/data/output", + "fileExtension": "json", + "onExists": "EXCEPTION", + "prettyPrint": false + } + } + } +} +---- + +=== Configuration + +[cols="1,1,3"] +|=== +|Field |Default |Description + +|`basePath` +|_required_ +|Base output directory. The emit key is resolved relative to this path. + +|`fileExtension` +|`json` +|Extension appended to each output file. For `CONTENT_ONLY` mode, set this to match the handler type (`txt`, `html`, `md`, `xml`). + +|`onExists` +|`EXCEPTION` +|Behavior when the output file already exists: `SKIP` (do nothing), `REPLACE` (overwrite), `EXCEPTION` (fail loudly). + +|`prettyPrint` +|`false` +|Pretty-print JSON output. Has no effect in `CONTENT_ONLY` mode (raw bytes are written). +|=== + +[#file-system-iterator] +== File System Iterator (`file-system-pipes-iterator`) + +Recursively walks a directory tree, emitting one `FetchEmitTuple` per file found. + +[source,json] +---- +{ + "pipes-iterator": { + "file-system-pipes-iterator": { + "basePath": "/data/input", + "countTotal": true, + "fetcherId": "fsf", + "emitterId": "fse" + } + } +} +---- + +=== Configuration + +[cols="1,1,3"] +|=== +|Field |Default |Description + +|`basePath` +|_required_ +|Root directory to walk. + +|`countTotal` +|`true` +|If `true`, walks the tree once to count files before processing begins. Enables progress reporting at the cost of an extra scan over the tree. + +|`fetcherId` / `emitterId` +|_required_ +|IDs of the fetcher and emitter to bind to each emitted tuple. See xref:pipes/iterators.adoc[Pipes Iterators] for the shared iterator contract. +|=== + +=== Notes + +* Walk order is filesystem-dependent and not guaranteed stable across runs. +* The relative path of each file (from `basePath`) becomes the fetch key, and by default also the emit key. +* Symbolic links are followed. + +[#file-system-reporter] +== File System Reporter (`file-system-reporter`) + +Maintains a JSON status file that summarizes pipeline progress. The reporter writes the file periodically on a background thread; per-record `report()` calls only update in-memory counters. + +[source,json] +---- +{ + "pipes-reporters": { + "file-system-reporter": { + "statusFile": "/var/log/tika/status.json", + "reportUpdateMs": 1000 + } + } +} +---- + +`pipes-reporters` accepts multiple reporters keyed by type name — see xref:pipes/reporters.adoc[Pipes Reporters] for how multiple reporters compose. + +=== Configuration + +[cols="1,1,3"] +|=== +|Field |Default |Description + +|`statusFile` +|_required_ +|Path of the JSON status file. The file is created on first write and overwritten in place. + +|`reportUpdateMs` +|_no default_ +|Interval in milliseconds between status-file writes. Typical values: `1000` for a low-overhead heartbeat, `100` for near-real-time updates. There is no built-in default — always set this explicitly. +|=== + +=== Status file schema + +The reporter serializes an `AsyncStatus` object to JSON, containing: + +* `asyncStatus` — current pipeline phase (`STARTED`, `COMPLETED`, `CRASHED`). +* `counts` — map of `RESULT_STATUS` to count (e.g., `PARSE_SUCCESS`, `PARSE_EXCEPTION`, `TIMEOUT`, `OOM`). +* `totalCountResult` — total documents processed and whether the enumeration is complete. +* `timestamp` — when the file was last written. +* `crashMessage` — populated only on fatal pipeline failure. + +The file is rewritten in full on each tick, not appended. + +[#watching] +=== Live status for watching applications + +The reporter is designed to support external "watchers" — UIs, dashboards, or monitoring scripts that poll the status file to display pipeline progress. To use it that way, set `reportUpdateMs` to match your desired refresh rate: + +[source,json] +---- +"reportUpdateMs": 250 +---- + +The watcher polls `statusFile` on its own interval and reads the most recent snapshot. Because the file is rewritten in full with the latest status, watchers do not need to handle partial reads. + +This pattern is used by `tika-gui-v2` to drive its progress UI: the GUI starts a pipeline subprocess, points the reporter at a temp file, and polls that file every few hundred milliseconds. + +Tradeoffs: + +* Smaller `reportUpdateMs` values mean more disk writes. On a fast SSD this is negligible, but on a slow disk (or NFS) the writer thread can become a bottleneck. +* The reporter thread sleeps between writes, so the worst-case staleness of the file is `reportUpdateMs` milliseconds plus serialization time. +* Per-record `report()` calls are cheap (counter increment only). The cost of "watching" is bounded by the periodic write, not by document throughput. + +[#security-notes] +== Security Notes + +* **`basePath` is a sandbox boundary.** The fetcher and emitter reject fetch/emit keys that resolve outside `basePath`. Do not set `allowAbsolutePaths=true` unless the source of fetch keys is fully trusted — an attacker-controlled fetch key could otherwise read arbitrary files. +* **Symlinks are followed.** A symlink under `basePath` pointing outside `basePath` may still be readable. If you need strict containment, do not allow symlinks in your input tree. +* **Output directories are created automatically.** The emitter creates intermediate directories as needed. Make sure the process's umask is appropriate for the data being written.
