Re: [dev] reading an epub book with less: adventures in text processing

2024-03-11 Thread Viktor Grigorov


Rather late to the party and I've already forgotten the initial email. 
Nevertheless, I'll give the program I most use: epub2txt.[0] It's not perfect, 
but compared to calibre's ebook-convert, and everything else I found in C in 
github or codeberg or gitlab, it's the best. A once-over with an editor capable 
of multiple selection and edition is the most I've had to do. Faulty output 
includes, say, only a single letter rather than a whole word capitalised or 
within '\e[...m' and '\e[0m'.

Protip; Run it with -w 0 to get 'natural' paragraphs.


[0] https://github.com/kevinboone/epub2txt2





Re: [dev] reading an epub book with less: adventures in text processing

2024-03-11 Thread Κρακ Άουτ
On 2024-03-11 17:44 Greg Reagle  wrote:

> Now my next question is, what is the tool that does the *best* job of
> turning a PDF book into a readable text document?  Via html or
> docbook or markdown or whatever--doesn't matter.  My previous
> experience trying things out to achieve this goal is that it's just
> not worth it.  The output always winds up un-readable.

I use pdftotext from poppler-utils. It does quite good job.

This is my main pdf reader command:
```
pdftotext -layout -nopgbrk ${1@Q} - | less -MS --use-color
```




Re: [dev] reading an epub book with less: adventures in text processing

2024-03-11 Thread Greg Reagle
On Sat, Mar 9, 2024, at 1:15 PM, Greg Minshall wrote:
> for some personal tastes/usage cases, this, using pandoc's `-t`
> option, might be minor-ly simpler:
> 
> man --local-file --pager 'less -ir' \
> <(pandoc --standalone -t man \
> 2015.31233.Arab-Geographers-Knowledge-Of-Southern-India.epub) | less
> 

Very cool command.  Good idea to use process substitution.  Here is another way 
of doing it:
pandoc --standalone -t man City_of_Truth-Morrow.epub | man /dev/stdin
but I don't know how portable /dev/stdin is.



Re: [dev] reading an epub book with less: adventures in text processing

2024-03-11 Thread Greg Reagle
On Sat, Mar 9, 2024, at 4:06 PM, Georg Lehner wrote:
> Option 1: use w3m
[snip]

All great commands.  Thank you.

> The reason you loose formatting when saving from less(1) or w3m is, that 
> these programs on purpose do not save the terminal control characters 
> which are doing the markup. Line breaks and terminal control are created 
> on demand, depending on the type and size of the terminal (window) and 
> will display different (weird) when any of this is different from the 
> terminal you (would have) saved them to a file.

Yes I have noticed this.  I would like to be able to tell programs to keep the 
formatting, but they decide automatically on their own to remove it.  The 
automatic decision to keep or remove formatting based on terminal type is fine, 
but I find it very annoying that I cannot override this decision with many 
programs.  GNU's ls is an exception (with the --color option).  I would like to 
tell w3m or elinks to dump html and keep the formatting, which they cannot do 
(directly).  There are ways around that cause extra steps.

> The -s option (--standalone) option for Pandoc is not required for man 
> page output.

Well it definitely is for me, meaning the version of Pandoc that I use: 
2.17.1.1-2~deb12u1 amd64



Re: [dev] reading an epub book with less: adventures in text processing

2024-03-11 Thread Greg Reagle
On Sat, Mar 9, 2024, at 11:33 AM, Hiltjo Posthuma wrote:
> Maybe mupdf/mutools or the eGhostscript tools o qpdf?

Yes, thank you for this excellent advice.  I tried "mutool convert", but I am 
more satisfied with pandoc's output, for both text and html output (from epub).



Re: [dev] reading an epub book with less: adventures in text processing

2024-03-09 Thread Georg Lehner

Hi Greg,

On 2024-03-09 15:34, Greg Reagle wrote:

I have an epub ebook.  It is a novel, but when I get this process working, I 
want to repeat it for any epub ebook.

I want to read it, with formatting (such as underline or italics), with less.  
I am happy to use any software that exists in the process, but I MUST use less 
in the end to read it.  The terminal emulators that I use are usually st, 
xterm, and termux.  All of them are capable of colored text and underlining and 
so forth, and I want to take advantage of this.

Pandoc does a very good job converting epub to html, and it looks good with 
w3m, however when I use w3m in a pipe, the output is truly *plain* text, 
meaning there are no escape codes for formatting.  Same story with elinks.  Is 
it possible to get either of these programs, or some other program, to dump 
html to text *with* escape codes?

Since I could not get HTML to work, I went with man format.  Amazing.  Pandoc 
automatically chooses man format for output based on the '.1' extension in the 
followingv
 pandoc --standalone -o City_of_Truth-Morrow.1 City_of_Truth-Morrow.epub
Remember to use standalone option or it won't work.  Then
 man --local-file --pager 'less -ir' City_of_Truth-Morrow.1
It looks great!  (for text only on a terminal)  It has bold and underlined 
text.  From there I can use less 's' command to save the formatted text to a 
file.

There might be a better or more direct way of achieving this goal, but this I 
what I figured out for now.  And the rationale is this:  I already know and 
love less.  There is no good reason for me to learn the user interface of a 
different program like an epub reader or an html reader to read a book that 
does not have graphics, diagrams, pictures, and/or custom formatting.


Just modify your workflow slightly and you are good:

Option 1: use w3m

pandoc -s -t html City_of_Truth-Morrow.epub | w3m -T text/html

Option 2: use man/less

pandoc -t man City_of_Truth-Morrow.epub | man -l -

Option 3, save as html for future use:

pandoc -s  -o City_of_Truth-Morrow.html City_of_Truth-Morrow.epub

Saves your epub to html. Whenever you want to view it, use your favorite 
browser, i.e. w3m, with all its features.


Option 4: save as man:

pandoc -s -t man -o City_of_Truth-Morrow.man City_of_Truth-Morrow.epub

Whenever you view it, use: man -l City_of_Truth-Morrow.man

- - -

Some notes:

The reason you loose formatting when saving from less(1) or w3m is, that 
these programs on purpose do not save the terminal control characters 
which are doing the markup. Line breaks and terminal control are created 
on demand, depending on the type and size of the terminal (window) and 
will display different (weird) when any of this is different from the 
terminal you (would have) saved them to a file.


The -s option (--standalone) option for Pandoc is not required for man 
page output. For html (and other formats) pandoc outputs only the  
content, the -s options wraps this into a complete  document.


Best Regards,


  Georg




Re: [dev] reading an epub book with less: adventures in text processing

2024-03-09 Thread Greg Minshall
Greg,

thanks for this!

for some personal tastes/usage cases, this, using pandoc's `-t`
option, might be minor-ly simpler:

man --local-file --pager 'less -ir' \
<(pandoc --standalone -t man \
 2015.31233.Arab-Geographers-Knowledge-Of-Southern-India.epub) | 
less


and, this deserves to be somewhere like fortune: "I already know and
love less.".  :)  maybe "fortune-mod-fles-pleh"?  :)

cheers, Greg




Re: [dev] reading an epub book with less: adventures in text processing

2024-03-09 Thread Hiltjo Posthuma
On Sat, Mar 09, 2024 at 09:34:12AM -0500, Greg Reagle wrote:
> I have an epub ebook.  It is a novel, but when I get this process working, I 
> want to repeat it for any epub ebook.
> 
> I want to read it, with formatting (such as underline or italics), with less. 
>  I am happy to use any software that exists in the process, but I MUST use 
> less in the end to read it.  The terminal emulators that I use are usually 
> st, xterm, and termux.  All of them are capable of colored text and 
> underlining and so forth, and I want to take advantage of this.
> 
> Pandoc does a very good job converting epub to html, and it looks good with 
> w3m, however when I use w3m in a pipe, the output is truly *plain* text, 
> meaning there are no escape codes for formatting.  Same story with elinks.  
> Is it possible to get either of these programs, or some other program, to 
> dump html to text *with* escape codes?
> 
> Since I could not get HTML to work, I went with man format.  Amazing.  Pandoc 
> automatically chooses man format for output based on the '.1' extension in 
> the followingv
> pandoc --standalone -o City_of_Truth-Morrow.1 City_of_Truth-Morrow.epub
> Remember to use standalone option or it won't work.  Then
> man --local-file --pager 'less -ir' City_of_Truth-Morrow.1
> It looks great!  (for text only on a terminal)  It has bold and underlined 
> text.  From there I can use less 's' command to save the formatted text to a 
> file.
> 
> There might be a better or more direct way of achieving this goal, but this I 
> what I figured out for now.  And the rationale is this:  I already know and 
> love less.  There is no good reason for me to learn the user interface of a 
> different program like an epub reader or an html reader to read a book that 
> does not have graphics, diagrams, pictures, and/or custom formatting.
> 

Hi,

Maybe mupdf/mutools or the eGhostscript tools o qpdf?

-- 
Kind regards,
Hiltjo



[dev] reading an epub book with less: adventures in text processing

2024-03-09 Thread Greg Reagle
I have an epub ebook.  It is a novel, but when I get this process working, I 
want to repeat it for any epub ebook.

I want to read it, with formatting (such as underline or italics), with less.  
I am happy to use any software that exists in the process, but I MUST use less 
in the end to read it.  The terminal emulators that I use are usually st, 
xterm, and termux.  All of them are capable of colored text and underlining and 
so forth, and I want to take advantage of this.

Pandoc does a very good job converting epub to html, and it looks good with 
w3m, however when I use w3m in a pipe, the output is truly *plain* text, 
meaning there are no escape codes for formatting.  Same story with elinks.  Is 
it possible to get either of these programs, or some other program, to dump 
html to text *with* escape codes?

Since I could not get HTML to work, I went with man format.  Amazing.  Pandoc 
automatically chooses man format for output based on the '.1' extension in the 
followingv
pandoc --standalone -o City_of_Truth-Morrow.1 City_of_Truth-Morrow.epub
Remember to use standalone option or it won't work.  Then
man --local-file --pager 'less -ir' City_of_Truth-Morrow.1
It looks great!  (for text only on a terminal)  It has bold and underlined 
text.  From there I can use less 's' command to save the formatted text to a 
file.

There might be a better or more direct way of achieving this goal, but this I 
what I figured out for now.  And the rationale is this:  I already know and 
love less.  There is no good reason for me to learn the user interface of a 
different program like an epub reader or an html reader to read a book that 
does not have graphics, diagrams, pictures, and/or custom formatting.