branch: externals/doc-toc commit b45b78102c285b0b0f2d38b74a16ada2b9c9bb23 Author: Daniel Nicolai <dalanico...@gmail.com> Commit: Daniel Nicolai <dalanico...@gmail.com>
Update README, add extract-only documentation --- README.org | 9 ++++++++- toc-mode.el | 3 +++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/README.org b/README.org index cb31fac9ea..6ee13df2b4 100644 --- a/README.org +++ b/README.org @@ -52,7 +52,14 @@ data). Also the languages used for tesseract OCR can be customized via the A buffer with the, somewhat cleaned up, extracted text will open in TOC-cleanup mode. Prefix command with the universal argument (=C-u=) to omit clean and get the -raw text. +raw text. If the extracted text is of too low quality you either can hack/extend +the [[help:toc-extract-pages-ocr][toc-extract-pages-ocr]] definition, or alternatively you can try to extract +the text with the [[https://pypi.org/project/document-contents-extractor/][python document-contents-extractor script]], which is more +configurable (you are also welcome to hack on and improve that script). +For this the [[https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html][tesseract]] documentation might be useful. + +If you merely want to extract text without further processing then you can +use the command [[help:toc-extract-only][toc-extract-only]]. ** 2. TOC-Cleanup In this mode you can further cleanup the contents to create a list where diff --git a/toc-mode.el b/toc-mode.el index 4d6f2f19c1..d2075a210a 100644 --- a/toc-mode.el +++ b/toc-mode.el @@ -63,6 +63,9 @@ ;; `https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html' might be ;; useful. +;; If you merely want to extract text without further processing then you can +;; use the command `toc-extract-only'. + ;; 2. TOC-Cleanup In this mode you can further cleanup the contents to create a ;; list where each line has the structure: