Re: [dev] reading an epub book with less: adventures in text processing
Rather late to the party and I've already forgotten the initial email. Nevertheless, I'll give the program I most use: epub2txt.[0] It's not perfect, but compared to calibre's ebook-convert, and everything else I found in C in github or codeberg or gitlab, it's the best. A once-over with an editor capable of multiple selection and edition is the most I've had to do. Faulty output includes, say, only a single letter rather than a whole word capitalised or within '\e[...m' and '\e[0m'. Protip; Run it with -w 0 to get 'natural' paragraphs. [0] https://github.com/kevinboone/epub2txt2
Re: [dev] reading an epub book with less: adventures in text processing
On 2024-03-11 17:44 Greg Reagle wrote: > Now my next question is, what is the tool that does the *best* job of > turning a PDF book into a readable text document? Via html or > docbook or markdown or whatever--doesn't matter. My previous > experience trying things out to achieve this goal is that it's just > not worth it. The output always winds up un-readable. I use pdftotext from poppler-utils. It does quite good job. This is my main pdf reader command: ``` pdftotext -layout -nopgbrk ${1@Q} - | less -MS --use-color ```
Re: [dev] reading an epub book with less: adventures in text processing
On Sat, Mar 9, 2024, at 1:15 PM, Greg Minshall wrote: > for some personal tastes/usage cases, this, using pandoc's `-t` > option, might be minor-ly simpler: > > man --local-file --pager 'less -ir' \ > <(pandoc --standalone -t man \ > 2015.31233.Arab-Geographers-Knowledge-Of-Southern-India.epub) | less > Very cool command. Good idea to use process substitution. Here is another way of doing it: pandoc --standalone -t man City_of_Truth-Morrow.epub | man /dev/stdin but I don't know how portable /dev/stdin is.
Re: [dev] reading an epub book with less: adventures in text processing
On Sat, Mar 9, 2024, at 4:06 PM, Georg Lehner wrote: > Option 1: use w3m [snip] All great commands. Thank you. > The reason you loose formatting when saving from less(1) or w3m is, that > these programs on purpose do not save the terminal control characters > which are doing the markup. Line breaks and terminal control are created > on demand, depending on the type and size of the terminal (window) and > will display different (weird) when any of this is different from the > terminal you (would have) saved them to a file. Yes I have noticed this. I would like to be able to tell programs to keep the formatting, but they decide automatically on their own to remove it. The automatic decision to keep or remove formatting based on terminal type is fine, but I find it very annoying that I cannot override this decision with many programs. GNU's ls is an exception (with the --color option). I would like to tell w3m or elinks to dump html and keep the formatting, which they cannot do (directly). There are ways around that cause extra steps. > The -s option (--standalone) option for Pandoc is not required for man > page output. Well it definitely is for me, meaning the version of Pandoc that I use: 2.17.1.1-2~deb12u1 amd64
Re: [dev] reading an epub book with less: adventures in text processing
On Sat, Mar 9, 2024, at 11:33 AM, Hiltjo Posthuma wrote: > Maybe mupdf/mutools or the eGhostscript tools o qpdf? Yes, thank you for this excellent advice. I tried "mutool convert", but I am more satisfied with pandoc's output, for both text and html output (from epub).
Re: [dev] reading an epub book with less: adventures in text processing
Hi Greg, On 2024-03-09 15:34, Greg Reagle wrote: I have an epub ebook. It is a novel, but when I get this process working, I want to repeat it for any epub ebook. I want to read it, with formatting (such as underline or italics), with less. I am happy to use any software that exists in the process, but I MUST use less in the end to read it. The terminal emulators that I use are usually st, xterm, and termux. All of them are capable of colored text and underlining and so forth, and I want to take advantage of this. Pandoc does a very good job converting epub to html, and it looks good with w3m, however when I use w3m in a pipe, the output is truly *plain* text, meaning there are no escape codes for formatting. Same story with elinks. Is it possible to get either of these programs, or some other program, to dump html to text *with* escape codes? Since I could not get HTML to work, I went with man format. Amazing. Pandoc automatically chooses man format for output based on the '.1' extension in the followingv pandoc --standalone -o City_of_Truth-Morrow.1 City_of_Truth-Morrow.epub Remember to use standalone option or it won't work. Then man --local-file --pager 'less -ir' City_of_Truth-Morrow.1 It looks great! (for text only on a terminal) It has bold and underlined text. From there I can use less 's' command to save the formatted text to a file. There might be a better or more direct way of achieving this goal, but this I what I figured out for now. And the rationale is this: I already know and love less. There is no good reason for me to learn the user interface of a different program like an epub reader or an html reader to read a book that does not have graphics, diagrams, pictures, and/or custom formatting. Just modify your workflow slightly and you are good: Option 1: use w3m pandoc -s -t html City_of_Truth-Morrow.epub | w3m -T text/html Option 2: use man/less pandoc -t man City_of_Truth-Morrow.epub | man -l - Option 3, save as html for future use: pandoc -s -o City_of_Truth-Morrow.html City_of_Truth-Morrow.epub Saves your epub to html. Whenever you want to view it, use your favorite browser, i.e. w3m, with all its features. Option 4: save as man: pandoc -s -t man -o City_of_Truth-Morrow.man City_of_Truth-Morrow.epub Whenever you view it, use: man -l City_of_Truth-Morrow.man - - - Some notes: The reason you loose formatting when saving from less(1) or w3m is, that these programs on purpose do not save the terminal control characters which are doing the markup. Line breaks and terminal control are created on demand, depending on the type and size of the terminal (window) and will display different (weird) when any of this is different from the terminal you (would have) saved them to a file. The -s option (--standalone) option for Pandoc is not required for man page output. For html (and other formats) pandoc outputs only the content, the -s options wraps this into a complete document. Best Regards, Georg
Re: [dev] reading an epub book with less: adventures in text processing
Greg, thanks for this! for some personal tastes/usage cases, this, using pandoc's `-t` option, might be minor-ly simpler: man --local-file --pager 'less -ir' \ <(pandoc --standalone -t man \ 2015.31233.Arab-Geographers-Knowledge-Of-Southern-India.epub) | less and, this deserves to be somewhere like fortune: "I already know and love less.". :) maybe "fortune-mod-fles-pleh"? :) cheers, Greg
Re: [dev] reading an epub book with less: adventures in text processing
On Sat, Mar 09, 2024 at 09:34:12AM -0500, Greg Reagle wrote: > I have an epub ebook. It is a novel, but when I get this process working, I > want to repeat it for any epub ebook. > > I want to read it, with formatting (such as underline or italics), with less. > I am happy to use any software that exists in the process, but I MUST use > less in the end to read it. The terminal emulators that I use are usually > st, xterm, and termux. All of them are capable of colored text and > underlining and so forth, and I want to take advantage of this. > > Pandoc does a very good job converting epub to html, and it looks good with > w3m, however when I use w3m in a pipe, the output is truly *plain* text, > meaning there are no escape codes for formatting. Same story with elinks. > Is it possible to get either of these programs, or some other program, to > dump html to text *with* escape codes? > > Since I could not get HTML to work, I went with man format. Amazing. Pandoc > automatically chooses man format for output based on the '.1' extension in > the followingv > pandoc --standalone -o City_of_Truth-Morrow.1 City_of_Truth-Morrow.epub > Remember to use standalone option or it won't work. Then > man --local-file --pager 'less -ir' City_of_Truth-Morrow.1 > It looks great! (for text only on a terminal) It has bold and underlined > text. From there I can use less 's' command to save the formatted text to a > file. > > There might be a better or more direct way of achieving this goal, but this I > what I figured out for now. And the rationale is this: I already know and > love less. There is no good reason for me to learn the user interface of a > different program like an epub reader or an html reader to read a book that > does not have graphics, diagrams, pictures, and/or custom formatting. > Hi, Maybe mupdf/mutools or the eGhostscript tools o qpdf? -- Kind regards, Hiltjo
[dev] reading an epub book with less: adventures in text processing
I have an epub ebook. It is a novel, but when I get this process working, I want to repeat it for any epub ebook. I want to read it, with formatting (such as underline or italics), with less. I am happy to use any software that exists in the process, but I MUST use less in the end to read it. The terminal emulators that I use are usually st, xterm, and termux. All of them are capable of colored text and underlining and so forth, and I want to take advantage of this. Pandoc does a very good job converting epub to html, and it looks good with w3m, however when I use w3m in a pipe, the output is truly *plain* text, meaning there are no escape codes for formatting. Same story with elinks. Is it possible to get either of these programs, or some other program, to dump html to text *with* escape codes? Since I could not get HTML to work, I went with man format. Amazing. Pandoc automatically chooses man format for output based on the '.1' extension in the followingv pandoc --standalone -o City_of_Truth-Morrow.1 City_of_Truth-Morrow.epub Remember to use standalone option or it won't work. Then man --local-file --pager 'less -ir' City_of_Truth-Morrow.1 It looks great! (for text only on a terminal) It has bold and underlined text. From there I can use less 's' command to save the formatted text to a file. There might be a better or more direct way of achieving this goal, but this I what I figured out for now. And the rationale is this: I already know and love less. There is no good reason for me to learn the user interface of a different program like an epub reader or an html reader to read a book that does not have graphics, diagrams, pictures, and/or custom formatting.