Re: [dev] reading an epub book with less: adventures in text processing

2024-03-11 Thread Viktor Grigorov


Rather late to the party and I've already forgotten the initial email. 
Nevertheless, I'll give the program I most use: epub2txt.[0] It's not perfect, 
but compared to calibre's ebook-convert, and everything else I found in C in 
github or codeberg or gitlab, it's the best. A once-over with an editor capable 
of multiple selection and edition is the most I've had to do. Faulty output 
includes, say, only a single letter rather than a whole word capitalised or 
within '\e[...m' and '\e[0m'.

Protip; Run it with -w 0 to get 'natural' paragraphs.


[0] https://github.com/kevinboone/epub2txt2





Re: [dev] reading an epub book with less: adventures in text processing

2024-03-11 Thread Κρακ Άουτ
On 2024-03-11 17:44 Greg Reagle  wrote:

> Now my next question is, what is the tool that does the *best* job of
> turning a PDF book into a readable text document?  Via html or
> docbook or markdown or whatever--doesn't matter.  My previous
> experience trying things out to achieve this goal is that it's just
> not worth it.  The output always winds up un-readable.

I use pdftotext from poppler-utils. It does quite good job.

This is my main pdf reader command:
```
pdftotext -layout -nopgbrk ${1@Q} - | less -MS --use-color
```




Re: [dev] reading an epub book with less: adventures in text processing

2024-03-11 Thread Greg Reagle
On Sat, Mar 9, 2024, at 1:15 PM, Greg Minshall wrote:
> for some personal tastes/usage cases, this, using pandoc's `-t`
> option, might be minor-ly simpler:
> 
> man --local-file --pager 'less -ir' \
> <(pandoc --standalone -t man \
> 2015.31233.Arab-Geographers-Knowledge-Of-Southern-India.epub) | less
> 

Very cool command.  Good idea to use process substitution.  Here is another way 
of doing it:
pandoc --standalone -t man City_of_Truth-Morrow.epub | man /dev/stdin
but I don't know how portable /dev/stdin is.



[dev] Re: reading an epub book with less: adventures in text processing

2024-03-11 Thread Greg Reagle
I think I finally figured it out!  With help, of course, from my wise and 
helpful community.  Thanks!  And reading the man page for elinks. :>

for direct viewing in less:
pandoc -s -t html City_of_Truth-Morrow.epub | elinks -dump-color-mode 2 
-force-html | less -ir

to make a file to keep, for repeated viewing in less:
pandoc -s -t html City_of_Truth-Morrow.epub | elinks -dump-color-mode 2 
-force-html > City_of_Truth-Morrow-formatted.txt

Now my next question is, what is the tool that does the *best* job of turning a 
PDF book into a readable text document?  Via html or docbook or markdown or 
whatever--doesn't matter.  My previous experience trying things out to achieve 
this goal is that it's just not worth it.  The output always winds up 
un-readable.



Re: [dev] Re: reading an epub book with less: adventures in text processing

2024-03-11 Thread Greg Reagle
On Sat, Mar 9, 2024, at 12:53 PM, LM wrote:
> You could try modifying sdlbook or bard.  It would be nice if either of these 
> offered keymapping functionality like some programming editors do.

Thank you for telling me about these two programs.  I had not heard of them.

https://github.com/rofl0r/SDLBook
https://github.com/festvox/bard



Re: [dev] reading an epub book with less: adventures in text processing

2024-03-11 Thread Greg Reagle
On Sat, Mar 9, 2024, at 4:06 PM, Georg Lehner wrote:
> Option 1: use w3m
[snip]

All great commands.  Thank you.

> The reason you loose formatting when saving from less(1) or w3m is, that 
> these programs on purpose do not save the terminal control characters 
> which are doing the markup. Line breaks and terminal control are created 
> on demand, depending on the type and size of the terminal (window) and 
> will display different (weird) when any of this is different from the 
> terminal you (would have) saved them to a file.

Yes I have noticed this.  I would like to be able to tell programs to keep the 
formatting, but they decide automatically on their own to remove it.  The 
automatic decision to keep or remove formatting based on terminal type is fine, 
but I find it very annoying that I cannot override this decision with many 
programs.  GNU's ls is an exception (with the --color option).  I would like to 
tell w3m or elinks to dump html and keep the formatting, which they cannot do 
(directly).  There are ways around that cause extra steps.

> The -s option (--standalone) option for Pandoc is not required for man 
> page output.

Well it definitely is for me, meaning the version of Pandoc that I use: 
2.17.1.1-2~deb12u1 amd64



Re: [dev] reading an epub book with less: adventures in text processing

2024-03-11 Thread Greg Reagle
On Sat, Mar 9, 2024, at 11:33 AM, Hiltjo Posthuma wrote:
> Maybe mupdf/mutools or the eGhostscript tools o qpdf?

Yes, thank you for this excellent advice.  I tried "mutool convert", but I am 
more satisfied with pandoc's output, for both text and html output (from epub).



Re: [dev] [sbase] Defining scope of sbase and ubase

2024-03-11 Thread Roberto E. Vargas Caballero
Hi,

After reading the opinion of the people in this thread,
I think the best option is to merge the sbase and ubase
repositories and having a mechanism in the build system
to select the set of tools to be included in the build.
The main drawback of this is that the build system will
be more complex than the one that we have now.

I am going to import the history of ubase into sbase and
I will push to a new branch. Meanwhile, I am going to
keep frozen the pending patches that we have in the hackers
mailing list until the migration is done. I would ask
later to the authors to resend them adjusted to the structure
of the project. Please, raise your concerns if you consider
that something else should be done.

More mails are expected in specific threads while we
progress with the migration.

Regards,