Hi Tomas.

1.) This is what each "remote" looks like by way of E/R:

(class +WordCount +Entity)
(rel article   (+Ref +Number))
(rel word      (+Aux +Ref +Number) (article))
(rel count     (+Number))
(rel picoStamp (+Ref +Number))

(dbs
  (4 +WordCount)
  (3 (+WordCount word article picoStamp)))

The bottleneck lies somewhere else than the actual lookup, here are some
results I just got when probably using the application all by myself:

"picolisp" => 1.97 s
"google" => 7.22 s
"obama" => 1.64 s (cached from prior search in RAM maybe?)
"afghanistan" => 7.2 s

Note the difference between google and picolisp we the search is performed
in exactly the same way, the only difference being that the system needs to
do post work after the results have been fetched and that is more work with
the google search since it returns the maximum 50 where picolisp only
returns 8. So the bottleneck is not the search itself but rather badly
optimized code that goes to work on the results later.

a way of extracting and specifying the interesting content from the
> harvested feeds and links their articles point to
>

Well the links you should be able to see in a per feed/category link map (I
noticed it was broken hopefully it will work from now on). As per specifying
content through an Xpath what is it that you hope to gain by that? Give me a
specific example please.

The main imperative for me to create the reader is the fact that the Google
Reader's GUI is horrible IMO and I'm happy with that part of VizReader. That
and I thought it would be an easy thing to start out with in PL, but there
is more to a feed reader than meets the eye... If I had thought about making
the application distributed right from the start I would've been even
happier.

In the beginning I also had an algorithm that compared articles for
automatic recommendations of similar content, that worked for a short time.
If I were to currently apply it then it would take it roughly one year to
compare all articles with each other. At one point I only let it compare a
random subset but that resulted in (predictably) random quality too :-)

Also, a lot of the finesse of the application is lost if you're not a
Twitter user. The majority of the time I spend in it is simply checking my
flow from time to time where most of the flow consists of Twitter posts
since few "normal" feeds have implemented the pubsub protocol yet.

Cheers,
Henrik Sarvell


On Tue, Jul 20, 2010 at 7:45 PM, Tomas Hlavaty <t...@logand.com> wrote:
> Hi Henrik,
>
>> Currently vizreader.com contains roughly 350 000 articles with a full
>> word index (not partial).
>>
>> The word index is spread out on "virtual remotes" ie they are not
>> really on remote machines, it's more a way to split up the physical
>> database files on disk (I've written on how that is done on
>> picolisp.com). I have no way of knowing how many words are mapped to
>> their articles like this but most of the database is occupied by these
>> indexes and it currently occupies some 30GB all in all.
>>
>> A search for the word "Google" just took 22 seconds.
>
> if I understand it well, you have all the articles locally on one
> machine.  I wonder how long a simple grep over the article blobs would
> take?  22 seconds seems very long for any serious use.  Have you
> considered some state-of-the-art full text search engine, e.g. Lucene?
>
> Just curious, how did you create the word index?  I implemented a simple
> search functionality and word index for LogandCMS which you can try as
> http://demo.cms.logand.com/search.html?s=sheep and I even keep the count
> of every word in each page for ranking purposes but I haven't had a
> chance to run into scaling problems like that.
>
>> No other part of the application is lagging significantly except for
>> when listing new articles in my news category due to the fact that
>> there are so many articles in that category. However the fetching
>> method is highly inefficient as I first fetch all feeds in a category
>> and then all their articles and then take (tail) on them to get the 50
>> newest for instance. Walking and then only loading the wanted articles
>> to memory would of course be the best way and something I will look
>> into.
>>
>> Why don't you try out the application yourself now that you know how
>> big the database is and so on, if you use Google Reader you can just
>> export your subscriptions as an OPML and import it into VizReader.
>
> I tried it and it looks interesting.  What feature I would actually want
> from such a system is a way of extracting and specifying the interesting
> content from the harvested feeds and links their articles point to,
> e.g. using an xpath expression.  Then, either publishing it as per user
> feed or sending that as email(s) so I could use my usual mail client to
> read the news.
>
> Cheers,
>
> Tomas
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>

Reply via email to