Re: [wikireader] Project Gutenberg (again)

2010-06-09 Thread Tom Bachmann
 Sure. We'll add it to our todo list. Please keep us posted as to your
 progress. This is super exciting work you're doing!

Well, there is not all that much more to say. I have been fixing minor 
glitches during the last days. I completely converted gutenberg-de 
yesterday and it is working fine in the simulator. I'm currently 
converting all of the german and english ebooks of project gutenberg 
(about 25000 ebooks, this will yield about 3.5GB of .dat files). Will 
probably take all day and longer on my dual-core laptop.

When I return to Germany on Sunday (I study in the UK) I will finally 
order a wikireader to test this on real hardware.

Tom

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: [wikireader] Project Gutenberg (again)

2010-06-08 Thread Tom Bachmann
Sean,

thanks for your quick reply.

 1) Is there a deep reason why boldface fonts are not implemented? I
 figure they are not really relevant for wikis, but would be nice for
 some of the books. Unless there is something that complicates the matter
 I'm not seeing, I think I will add them (should be straightforward to
 mimic the behaviour of italic fonts?).

 They are implemented. We just didn't include them to save space. (Font
 sets are super huge when you include all the unicode characters!)

 If you look at the function handle_data within
 http://github.com/wikireader/wikireader/blob/master/host-tools/offline-renderer/ArticleRenderer.py
 you'll see what I mean.


Hm. I thought I had convinced myself that the real problem was that only 
two bits are used to encode the font id, and they are already used up 
(default, italic, title, subtitle, and supplements [large files with 
all characters I suppose] for default, title, subtitle). So adding 
boldface fonts to the wiki-app *does* seem to involve some non-trivial 
work. (I guess the advantage of splitting the fonts like this is that 
the small subset can be kept in memory all the time? The size of the 
fontfiles themselves is on the order of megabites so shouldn't matter, 
should it?)

 Sure we can do this. No problem! The font is getting more and more
 complex since we actually hand make many of the characters now.


That would be really awesome.


___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: [wikireader] Project Gutenberg (again)

2010-06-08 Thread Tom Bachmann
 This has nothing to do with the data structures. We cache the fonts
 into the SDRAM to speed up the entire system. Without this, WikiReader
 is too painfully slow (reading from the SD card caps out at around
 125kb/s.) Currently we use 32MB of SDRAM. This means we can hold a few
 font styles but we need to move to smaller size SDRAM for future
 productions for cost reasons. So we have to be super careful with how
 we handle fonts. It's quite a complex problem for us. Especially as we
 add more and more language support.


I see. That's the kind of deep problem I'd rather leave to you experts. 
I'll just wait and see if you cook something up. Till then I can live 
without boldface.

Regards,
Tom

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


[wikireader] Project Gutenberg (again)

2010-06-07 Thread Tom Bachmann
Dear all,

after taking a rather longish break, I'm back working on my project 
gutenberg integration code. This message consists of three parts: in the 
first part, I quickly describe what it is all about. The second part 
contains a number of technical questions, and the third part talks about 
bugs in wiki-app.

All code is available at gitorious: 
git://gitorious.org/wikireader-ness/wikireader-ness2.git


What this is all about:

My idea is that akin to wikipedia, project gutenberg provides a large 
collection of free data that may be nice carrying in your pocket. So I 
have been working on extending the offline-render to also process ebooks 
in EPUB format.
To quickly see what this is about, try

make DESTDIR=image WORKDIR=work WIKI_FILE_PREFIX=wiki  WIKI_LANGUAGE=en 
WIKI_DIR_SUFFIX=guten  EBOOK_FILES=ebook-samples VERBOSE=yes cleandirs 
createdirs birc

There is some more functionality on which I can elaborate if anyone is 
interested, but this is basically it.
You have to harvest the ebooks yourself, but I can provide scripts for 
project gutenberg, and also for project gutenberg-de.


Technical Questions:

1) Is there a deep reason why boldface fonts are not implemented? I 
figure they are not really relevant for wikis, but would be nice for 
some of the books. Unless there is something that complicates the matter 
I'm not seeing, I think I will add them (should be straightforward to 
mimic the behaviour of italic fonts?).

2) Could you please add the characters U+2039 and U+203A ('SINGLE 
LEFT-POINTING ANGLE QUOTATION MARK' and right-pointing version) to the 
font? They are used quite often in some books and the box just looks 
ugly. Again I would do this myself but there seem to be a number of 
intermediate stages in font generation that I don't really understand.

3) Is it possible that the english language image on 
dev.thewikireader.org is corrupted? When I try to extract it with 7z x 
enpedia.7z I get a cryptic Error: E_FAIL message. (I'm running 
standard 7z of debian testing, version 9.04 beta.)


Bugs in wiki-app:

I believe that in the course of writing and testing my extensions, I 
have fixed some minor bugs in the core wiki-app code. My changes are 
very small and isolated, so the maintainers of the main repository may 
wish to look at these files only.


Thanks,
Tom

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: [wikireader] Rudimentary support for several wikis

2010-01-26 Thread Tom Bachmann


Thomas HOCEDEZ wrote:
 Have you seen this on the git  
 (http://wiki.github.com/wikireader/wikireader/structure-of-sd-card)
 

Nope. I take it most of my work was useless …

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: [wikireader] Rudimentary support for several wikis

2010-01-26 Thread Tom Bachmann
Actually that fits my needs very well. My support for several wikis was 
a rudimentary hack at best, to enable my real goal: using the wikireader 
as an ebook reader. *That* code was almost trivial to port, and can be 
found at git://gitorious.org/wikireader-ness/wikireader-ness2.git. I'll 
keep pushing there (mainly for backup), so if anyone is interested in 
having the entire project gutenberg library in their pocket …

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: [wikireader] Rudimentary support for several wikis

2010-01-21 Thread Tom Bachmann
Alright. The latest commit has a ChangeCollection script. Use it like this:

ChangeCollection.py --from=none --to=1 --prefix=/path/to/image/pedia 
--dat-offset=${next free dat}

where ${next free dat} is the first unused number in the .dat namespace 
of the english wiki. This will take a long while (it has to decompress 
and recompress all articles!), but it is probably faster than 
re-rendering everything (on my laptop it takes about 40 seconds to patch 
1000 articles).
Next copy the pedia.idx, pedia.pfx, pedia.fnd, pedia.hsh, pedia?.dat of 
the english wiki to your image, renaming to pedia0.idx, pedia0.pfx, 
pedia0.fnd, pedia0.hsh (the pedia?.dat can keep their names). If you now 
boot my kernel, you should be able to change between both wikis, as 
described in my first post.

Please tell me if everything works as expected.

Thomas HOCEDEZ wrote:
 It would be awesome !
 
 I finished French Wiki last night, upload is in action. It will be 
 available before tonight  on some mirors.
 
 I'll post urls as soon as it is available.
 
 Thomas
 

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: [wikireader] Rudimentary support for several wikis

2010-01-20 Thread Tom Bachmann
 in the light of the awfully long render times for complete wikis, I 
figure I should create a 'change collection number' script.

Thomas HOCEDEZ wrote:
 Le 19/01/2010 16:33, Tom Bachmann a écrit :
 I now registered to the list, since unregistered didn't seem to come
 through and c...@thewikireader doesn't seem to respond. Possibly you
 might recive this message more than once.

  Original Message 
 Subject: [wikireader] Rudimentary support for several wikis
 Date: Sun, 17 Jan 2010 00:56:53 +
 From: Tom Bachmanntb...@cam.ac.uk
 To: community@lists.openmoko.org

 Hello,

 first of all, please CC me since I'm not registered to the list.

 Over the last few days I have been hacking together rudimentary support
 for displaying several collections of data (e.g. wikis of different
 languages) on the wikireader. This code is not yet ready to be
 incorporated into the main repository (I think), and furthermore I don't
 actually know if it complies with your ideas of simplicity.

 HOWEVER, I would be very grateful to everyone who can test the code. I
 don't yet have a real wikireader (i.e. I have been developing this on
 the simulator; I will get one after sorting out my budget...) and I'm
 worried that there might be problems related to e.g. the scarcity of
 memory on the reader (how much ram has it installed?).

 Here is what I did: basically, articles are now identified by their
 index and by their collection id (the highest four bits of the 32bit
 identifier). The .pfx, .fnd, .hsh and .idx files are replicated per
 collection. The .dat files are just numbered consecutively (and
 identified by the usual way). So if you have e.g. two collections, say
 english and french wikipedia, then your image layout may look like this:

 pedia0.idx pedia0.hsh pedia0.pfx pedia0.fnd
 pedia1.idx pedia1.hsh pedia1.pfx pedia1.fnd
 pedia0.dat pedia1.dat pedia2.dat pedia3.dat pedia4.dat

 You cannot tell what articles are in what .dat files (in principle
 articles from several wikis could be mixed in one file), but in practice
 we might have pedia0-2.dat corresponding to the collection 0 (english
 wiki) and pedia{3,4}.dat corresponding to collection 1 (french wiki).

 The searching functionality etc is implemented in the wiki-app, the user
 inteface is rather non-existent. As a hack for testing I'm statically
 configuring the system to use two collections (identified 0 and 1) and I
 added an invisible button to the upper right corner of the search menu
 to switch between the collections (in the simulator you will see a
 message). There seem to be some bugs in that button but it's really for
 testing only.

 In addition to implementing all that in the wiki-app, I modified the
 render, index and combine programs. All take a new --coll-number
 argument to identify the collection being worked on, and
 ArticleRender.py has a new --dat-number argument to specify the .dat
 file (--number only identifies the block for the .idx file).

 The good news is, you can just re-use your primary collection (the one
 identified by 0). The bad news is, all extra collections have to be
 re-built. For a quick test, try

 make  DESTDIR=image WORKDIR=work \
 XML_FILES=xml-file-samples/japanese_architects.xml \
 COLL_NUMBER=1 DAT_NUMBER=${first unused index in .dat} iprch


 make  DESTDIR=image WORKDIR=work install

 and then copy everything to your wikireader (or try sim4).

 Again, it would be *greatly* appreciated if someone could build a large
 second collection and try two real-life datasets on the wikireader.

 All the code is at gitorious (just because I am already registered there
 but not yet on github). To get it, do

 git clone git://gitorious.org/wikireader-ness/wikireader-ness.git

 Let me know what you think!

 Thanks,
 Tom



 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community

 
 It would be awesome !
 
 I finished French Wiki last night, upload is in action. It will be 
 available before tonight  on some mirors.
 
 I'll post urls as soon as it is available.
 
 Thomas
 

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


[wikireader] Rudimentary support for several wikis

2010-01-19 Thread Tom Bachmann
I now registered to the list, since unregistered didn't seem to come
through and c...@thewikireader doesn't seem to respond. Possibly you
might recive this message more than once.

 Original Message 
Subject: [wikireader] Rudimentary support for several wikis
Date: Sun, 17 Jan 2010 00:56:53 +
From: Tom Bachmann tb...@cam.ac.uk
To: community@lists.openmoko.org

Hello,

first of all, please CC me since I'm not registered to the list.

Over the last few days I have been hacking together rudimentary support
for displaying several collections of data (e.g. wikis of different
languages) on the wikireader. This code is not yet ready to be
incorporated into the main repository (I think), and furthermore I don't
actually know if it complies with your ideas of simplicity.

HOWEVER, I would be very grateful to everyone who can test the code. I
don't yet have a real wikireader (i.e. I have been developing this on
the simulator; I will get one after sorting out my budget...) and I'm
worried that there might be problems related to e.g. the scarcity of
memory on the reader (how much ram has it installed?).

Here is what I did: basically, articles are now identified by their
index and by their collection id (the highest four bits of the 32bit
identifier). The .pfx, .fnd, .hsh and .idx files are replicated per
collection. The .dat files are just numbered consecutively (and
identified by the usual way). So if you have e.g. two collections, say
english and french wikipedia, then your image layout may look like this:

pedia0.idx pedia0.hsh pedia0.pfx pedia0.fnd
pedia1.idx pedia1.hsh pedia1.pfx pedia1.fnd
pedia0.dat pedia1.dat pedia2.dat pedia3.dat pedia4.dat

You cannot tell what articles are in what .dat files (in principle
articles from several wikis could be mixed in one file), but in practice
we might have pedia0-2.dat corresponding to the collection 0 (english
wiki) and pedia{3,4}.dat corresponding to collection 1 (french wiki).

The searching functionality etc is implemented in the wiki-app, the user
inteface is rather non-existent. As a hack for testing I'm statically
configuring the system to use two collections (identified 0 and 1) and I
added an invisible button to the upper right corner of the search menu
to switch between the collections (in the simulator you will see a
message). There seem to be some bugs in that button but it's really for
testing only.

In addition to implementing all that in the wiki-app, I modified the
render, index and combine programs. All take a new --coll-number
argument to identify the collection being worked on, and
ArticleRender.py has a new --dat-number argument to specify the .dat
file (--number only identifies the block for the .idx file).

The good news is, you can just re-use your primary collection (the one
identified by 0). The bad news is, all extra collections have to be
re-built. For a quick test, try

make  DESTDIR=image WORKDIR=work \
   XML_FILES=xml-file-samples/japanese_architects.xml \
   COLL_NUMBER=1 DAT_NUMBER=${first unused index in .dat} iprch


make  DESTDIR=image WORKDIR=work install

and then copy everything to your wikireader (or try sim4).

Again, it would be *greatly* appreciated if someone could build a large
second collection and try two real-life datasets on the wikireader.

All the code is at gitorious (just because I am already registered there
but not yet on github). To get it, do

git clone git://gitorious.org/wikireader-ness/wikireader-ness.git

Let me know what you think!

Thanks,
Tom



___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community