Adrian Bird created TIKA-4746:
---------------------------------

             Summary: tika-4.0.0-alpha1 - General Documentation Comments
                 Key: TIKA-4746
                 URL: https://issues.apache.org/jira/browse/TIKA-4746
             Project: Tika
          Issue Type: Bug
    Affects Versions: 4.0.0
            Reporter: Adrian Bird


Here are some comments/thoughts etc. from looking at the updated, and 
unchanged, documentation. Some comments may not be valid if associated code 
changes have also been made (which I haven't checked). 
I've only really looked at the Tika App / Pipes / File System combination, 
although I have skimmed most of the others.

On the web documentation pages there is a static 'Contents' table in the top 
right that doesn't move when you scroll down. It is missing on the following 
pages:
- using-tika/index.adoc
- configuration/index.adoc
- maintainers/index.adoc
Also, these pages don't open when you click the closed triangle image.

General point - sometimes you use 'Solr' and 'Kafka' and sometimes 'Apache 
Solr' and 'Apache Kafka'. Should they all be one or the other?

using-tika/cli/index.adoc
- Command Line Options - this lists a subset of the options. Shouldn't it list 
all of them i.e. cover the same list that is output when doing `--help`.
- Tika Pipes processing (the first one) - I think this could be removed as it 
is covered in detail later.
- Extract Markdown from a file - I think `Extract Markdown from a file` would 
fit better after `Extract metadata as JSON`.
- How Pipes mode is activated (2nd bullet) - with the released code I get an 
exception if I specify both `--input` and `--output`.
- How Pipes mode is activated (2nd bullet) - some of the options are not in the 
Batch Options list below.
- Tika Pipes Options - I would expect this list to match what is output when 
doing `--help`.
- Tika Pipes Examples - the formatting is different for these examples and the 
ones above

pipes/index.adoc
- question - why is there a section on Emitters in this page, rather than in 
the Emitters page?

pipes/getting-started.adoc
- JSON Configuration - 1st Note - EMIT_INTERMEDIATE_RESULTS is also a 
placeholder token
- JSON Configuration example - there should be a '=' in '--config 
tika-config.json'

pipes/iterators.adoc
- why does it say 'they are not wrapped in a baseConfig block.' This is the 
only mention of 'baseConfig' in the documentation.

pipes/configuration.adoc
- Filesystem-to-filesystem pipeline - EMIT_INTERMEDIATE_RESULTS is also a 
placeholder token

pipes/parse-modes.adoc
- Content Handler Types - this mentions 'ContentHandlerFactory' and 
'parseContext' which seem like Java names and not JSON Config names.
- CLI Usage - should it be "The tika-app pipes processor ..." rather than 
'batch'

pipes/unpack-config.adoc
- Quick Start - 'ParseMode.UNPACK' doesn't reflect what is in the config.
- Configuration Options - this should say that these options are defined within 
the 'unpack-config' key.
- Enabling Frictionless Output -is 'UnpackConfig' ok here or should it be 
'unpack-config'.
- CLI Usage - I don't see the '--unpack' option in the `--help` output
 
pipes/timeouts.adoc
- CLI Usage - the output from '--help' doesn't seem to show that '--fork' etc. 
can be used in Pipes mode.

pipes/troubleshooting.adoc
- Log levels and sensitive data - I didn't see any documentation about logging 
in general and setting log levels.

pipes/plugins/filesystem.adoc
- Complete Pipeline Example - EMIT_INTERMEDIATE_RESULTS is also a placeholder 
token
- File System Reporter (file-system-reporter)- Configuration - statusFile - 
Does this have to be an absolute path and not a relative path? If so it would 
be worth saying..
- Status file schema - counts - in my file I'm seeing 'statusCounts' and not 
'counts'
- Status file schema - timestamp - in my file I'm seeing 'lastUpdate' rather 
than 'timestamp'
- Status file schema - in my file I'm also seeing 'started'

configuration/index.adoc
- I don't see a general overview of the configuration structure here (I know 
Pipes configuration is covered elsewhere). If a user new to Tika comes here and 
is starting with V4 they need more of an overview than is currently here e.g. 
it should cover the top level keys in a JSON config file.
- There are no links to the VML Parsers, External Parser and Tess4J OCR pages.

configuration/digesters.adoc
- Supported Algorithms - the output from '--help' does not mention the last 
three - should it?
 
migration-to-4x/serialization-4x.adoc
- Friendly Naming Convention - not really specific to this page, but how do 
users know what the friendly names are. Running the 'list--*' options all 
produce class names. 

advanced/index.adoc 
- it looks like all the topic entries are geared towards using the Java API and 
aren't available through the CLI and JSON configuration, with the exception of 
'Setting Limits'. Is it worth adding text to this effect?
 
advanced/language-detection.adoc 
- Overriding Model Selection - says 'Or via Tika’s JSON configuration mechanism 
if you are using SelfConfiguring component loading' - how can it be specified 
in a JSON config file.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to