Re: Scaling issue

Alexander Burger Sun, 11 Apr 2010 04:34:57 -0700

On Sun, Apr 11, 2010 at 12:25:42PM +0200, Henrik Sarvell wrote:
> What's additionally needed is:
> 
> 1.) Calculating total count somehow without retrieving all articles.


If it is simply the count of all articles in the DB, you can get it
directly from a '+Key' or '+Ref' index. I don't quite remember the E/R
model, but I found this in an old mail:

   (class +Article +Entity)
   (rel aid       (+Key +Number))
   (rel title     (+Idx +String))
   (rel htmlUrl   (+Key +String))

With that, (count (tree 'aid '+Article)) or (count (tree 'htmlUrl
'+Article)) will give all articles having the property 'aid' or
'htmlUrl' (not, however, via 'title', as an '+Idx' index creates more
than one tree node per object).

If you need distinguished counts (e.g. for groups of articles or
according to certain features), it might be necessary to build more
indexes, or simply maintain counts during import.


> 2.) Somehow sorting by date so I get say the 25 first articles.

This is also best done with a dedicated index, e.g.

   (rel dat (+Ref +Date))

in '+Article'. Then you could specify a reversed range (T . NIL) for a
pilog query

   (? (db dat +Article (T . NIL) @Article) (show @Article))

This will start with the newest article, and step backwards. Even easier
might be if you specify a range of dates, say from today till one week
ago. Then you could use 'collect'

   (collect 'dat '+Article (date) (- (date) 7))

or, as 'today' is not very informative,

   (collect 'dat '+Article T (- (date) 7))


> When searching for articles belonging to a certain feed containing a word in
> the content I now let the distributed indexes return all articles and then I
> simply use filter to get at the articles. And to do that I of course need to
> fetch all the articles in a certain feed, which works fine for most feeds
> but not Twitter as it now probably contains more than 10 000 articles.

I think that usually it should not be necessary to fetch all articles,
if you build a combined query with the 'select/3' predicate.


> The only solution I can see to this is to simply store the feed -> article
> mapping remotely too, ie each word index server contains this info too for
> ...
> Then I could simply filter by feed remotely.

Not sure. But I feel that I would use distributed processing here only
if there is no other way (i.e. the parallel search with 'select/3').

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

Reply via email to