> Currently vizreader.com contains roughly 350 000 articles with a full
> word index (not partial).
> The word index is spread out on "virtual remotes" ie they are not
> really on remote machines, it's more a way to split up the physical
> database files on disk (I've written on how that is done on
> picolisp.com). I have no way of knowing how many words are mapped to
> their articles like this but most of the database is occupied by these
> indexes and it currently occupies some 30GB all in all.
> A search for the word "Google" just took 22 seconds.
if I understand it well, you have all the articles locally on one
machine. I wonder how long a simple grep over the article blobs would
take? 22 seconds seems very long for any serious use. Have you
considered some state-of-the-art full text search engine, e.g. Lucene?
Just curious, how did you create the word index? I implemented a simple
search functionality and word index for LogandCMS which you can try as
http://demo.cms.logand.com/search.html?s=sheep and I even keep the count
of every word in each page for ranking purposes but I haven't had a
chance to run into scaling problems like that.
> No other part of the application is lagging significantly except for
> when listing new articles in my news category due to the fact that
> there are so many articles in that category. However the fetching
> method is highly inefficient as I first fetch all feeds in a category
> and then all their articles and then take (tail) on them to get the 50
> newest for instance. Walking and then only loading the wanted articles
> to memory would of course be the best way and something I will look
> Why don't you try out the application yourself now that you know how
> big the database is and so on, if you use Google Reader you can just
> export your subscriptions as an OPML and import it into VizReader.
I tried it and it looks interesting. What feature I would actually want
from such a system is a way of extracting and specifying the interesting
content from the harvested feeds and links their articles point to,
e.g. using an xpath expression. Then, either publishing it as per user
feed or sending that as email(s) so I could use my usual mail client to
read the news.