Bug#502702: chm2pdf is nowhere close to releasable

Chris Karakas Tue, 25 Nov 2008 01:14:06 -0800

Hello all,

I am the author of the code that came under such severe criticism by Thomas, so 
I feel compelled to answer.


When I  started this effort, no CHM to PDF conversion utility was available in 
Linux. There were a few, yes, of which chm2pdf had the best "concept" in my 
opinion, but they were all very limited to what they could accomplish. The code 
you criticise was created in an effort to push things to a more usable state. 
As always, one cannot satisfy all...

The standard scenario I had in mind when I started was...uhmm...a user like 
myself: you could take the script, change the CHM2PDF_TEMP_WORK_DIR and 
CHM2PDF_TEMP_ORIG_DIR variables at the top to something useful and acceptable, 
then run it and (most of the time) be happy! You can read about the whole story 
here:

http://www.karakas-online.de/forum/viewtopic.php?t=10275

It was NOT my intention to spend 3 months pondering about security issues 
arising from using the tmp dir or whatever else. My understanding was that if 
tmp is not suitable, the user should change it to some directory in his home.

It was also NOT a priority to make things as efficient as possible. We have to 
see if things work this way acceptably *first*. If they do, THEN we can 
optimize. I will not start thinking about how to optimize a search-and-replace 
procedure that takes a few minutes (at most) to run on modern computers before 
I have feedback from the "stakeholders" (the "community" in this case) that 
what I search and replace are the right things in the first place!

A good example is that of anchors, that Thomas finds should be kept because of 
all the API documentation that uses it, while, on the other hand, I found that 
it was better to ignore, as using them would too often lead to unusable 
results. Obviously we have here two sets of CHM files that behave differently 
with regard to anchors: a great deal of API docs that (I am inclined to 
believe) use anchors correctly and another great deal of other CHMs that don't.

I would *never* throw away anchors without a reason. So if I did it it was 
because I had to. Now if the community thinks we should keep them, I am open to 
suggestions on what to do if we keep them and we get one of those CHMs that 
don't use it correctly, or cause trouble. To put it more vividly: I was *fed 
up* seeing CHMs that would NOT work through the conversion, due to the anchor 
thing!

So we have to reach an agreement on "what is right" first, before we continue 
with "how to implement the right thing efficiently".

However, since I have a personal interest in data structures and algorithms 
myself (see my papers in 
http://www.karakas-online.de/myWork/computer_graphics.html), let me outline the 
general problem of this "search-and-replace" part of chm2pdf: 

We are given a set of filenames. Each file contains a certain amount of links 
to other files in the set. We don't know in advance how many, or what the mean 
value is. *Every* filename could be mentioned, even more than once (maybe two, 
three, or even 10 times), inside a given file. We want to "correct" all those 
links to some equivalents that we know in advance, i.e. we know that we have to 
change link A to link A', link B to link B' and so on, for known A, A', B, B' 
etc.

Suppose you have n filenames (i.e. n distinct files) and each file contains in 
the mean m links to other files. We have to do nm corrections - more or less, 
in some statistical sense. Now if the distinct files represent distinct pages 
(which is not always the case, usually they represent distinct subsections (in 
the DocBook sense) or some other *structural* (not presentational, as pages 
are) entity), then we have to do a number of corrections *of the order* of pm - 
where p is the number of pages and again m is the mean number of references 
(links) to other files from any given file in the collection.

How do you solve this efficiently? The *least* number of actions you have to 
take is of the order of pm, or, more accurately, is *in the mean, exactly* (not 
*order of*) nm. chm2pdf goes through each file (that's n in nm) and in order to 
find the m changes per file (m always understood in the mean sense here), it 
goes though the list of all *possible* changes, searches for them and does a 
replace if it finds a match. Now this is indeed of the order of n*n, because 
all *possible* changes are in an array of length n.

So we are talking about a reduction from nn to nm, where m could also be large 
(even of the order of n itself, let's not forget this!). Actually, it boils 
down to finding the right m occurrences of the filenames - in each file! We 
could, for example, first search for all links inside a file, then, for every 
link we find, check if it is in the list of filenames and do the "replace" 
part. But you still have to do a match against all *possible* filenames, i.e. 
you still have to do n comparisons (inside the loop for every one of the n 
filesnames). That's again nn (n squared) operations!

I am open to suggestions here, but I hope I could make my point that the 
reduction of algorithmic complexity may not be as easy as Thomas is suggesting. 

For all the above reasons, I suggest:

1) I will have a look at the patch for the tmp problem. *Please point me to a 
link of the diff!*
2) We reach an agreement on the solution of the tmp problem, i.e. either I 
apply the Debian patch, or come to some other solution that is also acceptable 
for Debian.
3) We agree what is the right thing to do with anchors - and I (or we) 
implement it.
4) We discuss the complexity issue and either come to the conclusion that the 
current implementation is totally acceptable, or someone proposes a real 
improvement (but please don't try to reduce the running time from, say, 2 
minutes, to 1,5 minutes using an algorithm that is 3 times as complex as the 
current one, OK? - code has to be maintainable and understandable as well!). If 
we get one and we decide it is worth implementing it, we do it, otherwise we 
will understand that Thomas' criticism was way too exaggerated.
5) THEN, Debian can go ahead and include it.

In the meantime, you can still include it, but as a "tool for every user to try 
at his own risk, in his own home dir" - if you have such a collection of tools. 
Bear in mind that chm2pdf is very fast and very good actually - even at this 
stage - so it would be a pity to deprive your users of its services.

I hope that's an acceptable "roadmap" for all. :-)

-- 
Regards

Chris Karakas
http://www.karakas-online.de



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#502702: chm2pdf is nowhere close to releasable

Reply via email to