[ 
https://issues.apache.org/jira/browse/TIKA-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085183#comment-18085183
 ] 

ASF GitHub Bot commented on TIKA-4746:
--------------------------------------

tballison opened a new pull request, #2852:
URL: https://github.com/apache/tika/pull/2852

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-XXXX`)
     - is referenced in the title of the pull request
     - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-XXXX] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `./mvnw clean 
test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> tika-4.0.0-alpha1 - General Documentation Comments
> --------------------------------------------------
>
>                 Key: TIKA-4746
>                 URL: https://issues.apache.org/jira/browse/TIKA-4746
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Adrian Bird
>            Priority: Major
>
> Here are some comments/thoughts etc. from looking at the updated, and 
> unchanged, documentation. Some comments may not be valid if associated code 
> changes have also been made (which I haven't checked). 
> I've only really looked at the Tika App / Pipes / File System combination, 
> although I have skimmed most of the others.
> On the web documentation pages there is a static 'Contents' table in the top 
> right that doesn't move when you scroll down. It is missing on the following 
> pages:
> - using-tika/index.adoc
> - configuration/index.adoc
> - maintainers/index.adoc
> Also, these pages don't open when you click the closed triangle image.
> General point - sometimes you use 'Solr' and 'Kafka' and sometimes 'Apache 
> Solr' and 'Apache Kafka'. Should they all be one or the other?
> using-tika/cli/index.adoc
> - Command Line Options - this lists a subset of the options. Shouldn't it 
> list all of them i.e. cover the same list that is output when doing `--help`.
> - Tika Pipes processing (the first one) - I think this could be removed as it 
> is covered in detail later.
> - Extract Markdown from a file - I think `Extract Markdown from a file` would 
> fit better after `Extract metadata as JSON`.
> - How Pipes mode is activated (2nd bullet) - with the released code I get an 
> exception if I specify both `--input` and `--output`.
> - How Pipes mode is activated (2nd bullet) - some of the options are not in 
> the Batch Options list below.
> - Tika Pipes Options - I would expect this list to match what is output when 
> doing `--help`.
> - Tika Pipes Examples - the formatting is different for these examples and 
> the ones above
> pipes/index.adoc
> - question - why is there a section on Emitters in this page, rather than in 
> the Emitters page?
> pipes/getting-started.adoc
> - JSON Configuration - 1st Note - EMIT_INTERMEDIATE_RESULTS is also a 
> placeholder token
> - JSON Configuration example - there should be a '=' in '--config 
> tika-config.json'
> pipes/iterators.adoc
> - why does it say 'they are not wrapped in a baseConfig block.' This is the 
> only mention of 'baseConfig' in the documentation.
> pipes/configuration.adoc
> - Filesystem-to-filesystem pipeline - EMIT_INTERMEDIATE_RESULTS is also a 
> placeholder token
> pipes/parse-modes.adoc
> - Content Handler Types - this mentions 'ContentHandlerFactory' and 
> 'parseContext' which seem like Java names and not JSON Config names.
> - CLI Usage - should it be "The tika-app pipes processor ..." rather than 
> 'batch'
> pipes/unpack-config.adoc
> - Quick Start - 'ParseMode.UNPACK' doesn't reflect what is in the config.
> - Configuration Options - this should say that these options are defined 
> within the 'unpack-config' key.
> - Enabling Frictionless Output -is 'UnpackConfig' ok here or should it be 
> 'unpack-config'.
> - CLI Usage - I don't see the '--unpack' option in the `--help` output
>  
> pipes/timeouts.adoc
> - CLI Usage - the output from '--help' doesn't seem to show that '--fork' 
> etc. can be used in Pipes mode.
> pipes/troubleshooting.adoc
> - Log levels and sensitive data - I didn't see any documentation about 
> logging in general and setting log levels.
> pipes/plugins/filesystem.adoc
> - Complete Pipeline Example - EMIT_INTERMEDIATE_RESULTS is also a placeholder 
> token
> - File System Reporter (file-system-reporter)- Configuration - statusFile - 
> Does this have to be an absolute path and not a relative path? If so it would 
> be worth saying..
> - Status file schema - counts - in my file I'm seeing 'statusCounts' and not 
> 'counts'
> - Status file schema - timestamp - in my file I'm seeing 'lastUpdate' rather 
> than 'timestamp'
> - Status file schema - in my file I'm also seeing 'started'
> configuration/index.adoc
> - I don't see a general overview of the configuration structure here (I know 
> Pipes configuration is covered elsewhere). If a user new to Tika comes here 
> and is starting with V4 they need more of an overview than is currently here 
> e.g. it should cover the top level keys in a JSON config file.
> - There are no links to the VML Parsers, External Parser and Tess4J OCR pages.
> configuration/digesters.adoc
> - Supported Algorithms - the output from '--help' does not mention the last 
> three - should it?
>  
> migration-to-4x/serialization-4x.adoc
> - Friendly Naming Convention - not really specific to this page, but how do 
> users know what the friendly names are. Running the 'list--*' options all 
> produce class names. 
> advanced/index.adoc 
> - it looks like all the topic entries are geared towards using the Java API 
> and aren't available through the CLI and JSON configuration, with the 
> exception of 'Setting Limits'. Is it worth adding text to this effect?
>  
> advanced/language-detection.adoc 
> - Overriding Model Selection - says 'Or via Tika’s JSON configuration 
> mechanism if you are using SelfConfiguring component loading' - how can it be 
> specified in a JSON config file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to