Adrian Bird created TIKA-4746:
---------------------------------
Summary: tika-4.0.0-alpha1 - General Documentation Comments
Key: TIKA-4746
URL: https://issues.apache.org/jira/browse/TIKA-4746
Project: Tika
Issue Type: Bug
Affects Versions: 4.0.0
Reporter: Adrian Bird
Here are some comments/thoughts etc. from looking at the updated, and
unchanged, documentation. Some comments may not be valid if associated code
changes have also been made (which I haven't checked).
I've only really looked at the Tika App / Pipes / File System combination,
although I have skimmed most of the others.
On the web documentation pages there is a static 'Contents' table in the top
right that doesn't move when you scroll down. It is missing on the following
pages:
- using-tika/index.adoc
- configuration/index.adoc
- maintainers/index.adoc
Also, these pages don't open when you click the closed triangle image.
General point - sometimes you use 'Solr' and 'Kafka' and sometimes 'Apache
Solr' and 'Apache Kafka'. Should they all be one or the other?
using-tika/cli/index.adoc
- Command Line Options - this lists a subset of the options. Shouldn't it list
all of them i.e. cover the same list that is output when doing `--help`.
- Tika Pipes processing (the first one) - I think this could be removed as it
is covered in detail later.
- Extract Markdown from a file - I think `Extract Markdown from a file` would
fit better after `Extract metadata as JSON`.
- How Pipes mode is activated (2nd bullet) - with the released code I get an
exception if I specify both `--input` and `--output`.
- How Pipes mode is activated (2nd bullet) - some of the options are not in the
Batch Options list below.
- Tika Pipes Options - I would expect this list to match what is output when
doing `--help`.
- Tika Pipes Examples - the formatting is different for these examples and the
ones above
pipes/index.adoc
- question - why is there a section on Emitters in this page, rather than in
the Emitters page?
pipes/getting-started.adoc
- JSON Configuration - 1st Note - EMIT_INTERMEDIATE_RESULTS is also a
placeholder token
- JSON Configuration example - there should be a '=' in '--config
tika-config.json'
pipes/iterators.adoc
- why does it say 'they are not wrapped in a baseConfig block.' This is the
only mention of 'baseConfig' in the documentation.
pipes/configuration.adoc
- Filesystem-to-filesystem pipeline - EMIT_INTERMEDIATE_RESULTS is also a
placeholder token
pipes/parse-modes.adoc
- Content Handler Types - this mentions 'ContentHandlerFactory' and
'parseContext' which seem like Java names and not JSON Config names.
- CLI Usage - should it be "The tika-app pipes processor ..." rather than
'batch'
pipes/unpack-config.adoc
- Quick Start - 'ParseMode.UNPACK' doesn't reflect what is in the config.
- Configuration Options - this should say that these options are defined within
the 'unpack-config' key.
- Enabling Frictionless Output -is 'UnpackConfig' ok here or should it be
'unpack-config'.
- CLI Usage - I don't see the '--unpack' option in the `--help` output
pipes/timeouts.adoc
- CLI Usage - the output from '--help' doesn't seem to show that '--fork' etc.
can be used in Pipes mode.
pipes/troubleshooting.adoc
- Log levels and sensitive data - I didn't see any documentation about logging
in general and setting log levels.
pipes/plugins/filesystem.adoc
- Complete Pipeline Example - EMIT_INTERMEDIATE_RESULTS is also a placeholder
token
- File System Reporter (file-system-reporter)- Configuration - statusFile -
Does this have to be an absolute path and not a relative path? If so it would
be worth saying..
- Status file schema - counts - in my file I'm seeing 'statusCounts' and not
'counts'
- Status file schema - timestamp - in my file I'm seeing 'lastUpdate' rather
than 'timestamp'
- Status file schema - in my file I'm also seeing 'started'
configuration/index.adoc
- I don't see a general overview of the configuration structure here (I know
Pipes configuration is covered elsewhere). If a user new to Tika comes here and
is starting with V4 they need more of an overview than is currently here e.g.
it should cover the top level keys in a JSON config file.
- There are no links to the VML Parsers, External Parser and Tess4J OCR pages.
configuration/digesters.adoc
- Supported Algorithms - the output from '--help' does not mention the last
three - should it?
migration-to-4x/serialization-4x.adoc
- Friendly Naming Convention - not really specific to this page, but how do
users know what the friendly names are. Running the 'list--*' options all
produce class names.
advanced/index.adoc
- it looks like all the topic entries are geared towards using the Java API and
aren't available through the CLI and JSON configuration, with the exception of
'Setting Limits'. Is it worth adding text to this effect?
advanced/language-detection.adoc
- Overriding Model Selection - says 'Or via Tika’s JSON configuration mechanism
if you are using SelfConfiguring component loading' - how can it be specified
in a JSON config file.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)