Thanks a lot for this very helpful answer.
I implement a solution very similar to yours and it runs, but I have 2
big problems.
The first one is "throughput".
If I have a periodic timer of 1 minute, I can only parse 20 (number of
threads) feeds per minute, which leads to 1200 per hour (since I want
to parse a feed at least once every hour). The problem is that I
really need to be able to parse at least 10 times this number of
feeds... and probably closer to 100k! What if I increase the number of
threads? Will I be able to parse more feeds?
The second one is actually a lot worse. I've had my system running for
a little more than a day... without monitoring it, and well, this
morning, everything was "down". I did a "ps aux" and here is what I
got :
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 21697 0.0 0.8 32524 15620 ? D Apr27 0:13 ruby
/mnt/app/current/script/backgroundrb start -e production
root 21698 0.0 0.2 32504 4736 ? D Apr27 0:08 ruby
log_worker
root 21699 1.1 90.5 2170872 1576364 ? D Apr27 25:58 ruby
parser_worker
As you can see, my parser_worker is consuming a little over 1,5Gb of
RAM : wayyyy too much ;) it seems that the vars are not destroyed in
my worker? Any idea of what's wrong?
Thanks a lot once again for your help!
Best,
On 4/25/08, Stevie Clifton <[EMAIL PROTECTED]> wrote:
> Hey Julien,
>
> It sounds like you are planning on using one "long running" feed
> parsing loop with a do...while. This is exactly the sort of thing you
> want to avoid in new bdrb, especially if you know you want to do
> something at discrete time periods--it totally goes against the
> twisted paradigm. After thinking about it for a bit, I would
> recommend setting just one periodic_timer for every minute, and then
> determining in your parse_feeds method which feeds need to be parsed.
> If I were you, I wouldn't use last_updated to determine when to parse
> your feeds -- it adds unnecessary complexity to your system. You can
> of course save that value for reference, but it's not necessary for
> your requirements.
>
> In your db you could have a field for every feed call "interval" that
> would determine the minute intervals to parse the feeds. Then every
> minute when parse_feed gets called, you could parse every feed with an
> interval of "1", and then determine based on the current minute in the
> hour whether or not to try to parse the 15, 30, or 60 minute feeds.
> And you'll of course want to use thread_pool.defer. So, using Paul's
> code as a starting point, something like this:
>
> def parse_feeds
> feeds = Feed.find_feeds_to_process
> feeds.each do |feed|
>
> thread_pool.defer do
> feed.parse
> end
> end
> end
>
>
> class Feed
> def find_feeds_to_process
> feeds = []
> [1, 15, 30, 60].each |interval|
> feeds << Feeds.find_by_interval( interval ) if Time.now.min %
> interval == 0
> end
> end
> def parse
> # parsing code
> end
> end
>
> On my way home yesterday I thought of another sexy addition you could
> add to this. In the above code, you know that you'll be parsing
> _every_ feed in your db on the hour, which isn't a very efficient
> setup. If possible, you want to set it up so that you have even
> parsing distribution throughout the hour so you're not getting
> hammered. You could add a pretty simple heuristic that would give you
> a relatively even distribution across the hour by using the hash of
> the feed url. Along with the url and the interval, save an "offset"
> value like this example:
>
> feed = Feed.new
> feed.url = ''my_feed_url'
> feed.interval = 15
> feed.offset = feed.url.hash % 60
> feed.save
>
> Then in find_feeds_to_process, you can do this (untested):
>
> # the select will return any feed for which its interval offset
> matches the current minute's offset for the same interval
> def find_feeds_to_process
> feeds = Feed.find(:all).select do |feed|
> [15, 30, 60].detect { |interval| feed.offset % interval ==
> Time.now.min % interval }
> end
> end
>
> Doing a Feed.find(:all) is probably not the best idea if you have a
> ton of records, so you might want to do multiple db finds to get the
> same results.
>
> stevie
>
>
> On Wed, Apr 23, 2008 at 5:46 PM, Julien Genestoux
>
> <[EMAIL PROTECTED]> wrote:
> > Thanks guys... that's a ton of info! I am definetely gonna use the
> > thread_pool... as soon as I can find the documentation ;D
> >
> > 1- For each feed, I define a "frequency" (every minute, every hour...
> > every 30 minutes...) that will be updated every time I'm parsing the
> > feed: if the parser returns "new" element, I am increasding the
> > frequency (from 1 time per hour, to 1 time per 30 min.), if not, I'm
> > decreasing the frequency...
> >
> > 2- I also have a "last_update" field which remembers the time when the
> > feed was parsed for the last time.
> >
> > 3- With 1 & 2, I know how "late" I am to parse a feed... so when I
> > choose my next feed to parse, I am always choosing the one that is the
> > most "late"
> >
> > I am not sure if Steevie's approach of having multiple tasks for the
> > worker applies here. Actually, I am not even schedulling my worker, I
> > am just launching it once, and the parse_feeds runs forever (while
> > true do... end)
> >
> > Also, if I understand well Paul's code, his approach allows my worker
> > to be more efficient always, but doesn't take into account the
> > "lateness" of my feeds.
> >
> >
> > My idea would be to add/remove worker according to "how late" I am in
> > parsing feeds.
> > If my the the lastest feed is late by more than 10min, I would add one
> > worker... and If my latest feed is late by less than 5 minutes, I
> > would remove one worker
> >
> > Does this approach makes sense to you?
> >
> > Thanks a lot for your help guys...
> >
> >
> >
> >
> > On 4/23/08, Paul Kmiec <[EMAIL PROTECTED]> wrote:
> > > You can use the built build thread pool to process more than one feed
> within
> > > the same worker. So within the worker, you'd do,
> > >
> > > def parse_feeds
> > > loop do
> > > feed = Feed.find_feed_to_process
> > > thread_pool.defer do
> > > feed.parse
> > > end
> > > end
> > > end
> > >
> > > I think the default pool size is 20. You can control the size of the
> thread
> > > pool using a class level method, as I recall it is
> > >
> > > pool_size x
> > >
> > > Paul
> > >
> > >
> > > On Wed, Apr 23, 2008 at 7:30 AM, Julien Genestoux
> > > <[EMAIL PROTECTED]> wrote:
> > > > Thanks Adam,
> > > >
> > > > That sounded weird to me as well to have one worker for each feed...
> > > > However, if I only have one worker, that also means that I am parsing
> > > > one feed only at any moment. An option, is maybe to have a few workers
> > > > (denpending on the number of feeds) that parse feeds concurrently?
> > > >
> > > > If I only have one worker, according to you what should be the
> > > > winnning strategy to choose the "right" parse to feed? Obviously some
> > > > feeds need to be parsed one every few minutes, while some other might
> > > > no need to be parse more than every hour...
> > > >
> > > > Any idea/tip on this?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 4/23/08, Adam Williams <[EMAIL PROTECTED]> wrote:
> > > > > On Apr 23, 2008, at 1:07 AM, Julien Genestoux wrote:
> > > > >
> > > > > > I still have a few questions : shoud I have one worker for each
> feed
> > > > > > that is called periodically (add_periodic_timer) or rather one
> single
> > > > > > worker that calls every feed one by one?
> > > > > >
> > > > > > What is the best solution, perfomance-wise?
> > > > >
> > > > >
> > > > > Good question... I don't suppose I know exactly. I would start by
> > > > > processing all the feeds in one worker invocation - that is what I
> > > > > have done for sending an unknown amount of email. It just seems
> wrong
> > > > > to me to invoke a worker for one email at a time.
> > > > >
> > > > > The right answer likely lies in understanding the whole
> MasterWorker,
> > > > > Packet::Reactor/handler_instance.ask_work bits of the
> > > puzzle...
> > > > >
> > > > >
> > > > > adam
> > > > >
> > > > > _______________________________________________
> > > > > Backgroundrb-devel mailing list
> > > > > [email protected]
> > > > >
> > > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > --
> > > > Julien Genestoux
> > > > [EMAIL PROTECTED]
> > > > http://www.ouvre-boite.com
> > > > +1 (415) 254 7340
> > > > +33 (0)8 70 44 76 29
> > > > _______________________________________________
> > > >
> > > >
> > > >
> > > > Backgroundrb-devel mailing list
> > > > [email protected]
> > > > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
> > > >
> > >
> > >
> >
> >
> > --
> >
> >
> > --
> > Julien Genestoux
> > [EMAIL PROTECTED]
> > http://www.ouvre-boite.com
> > +1 (415) 254 7340
> > +33 (0)8 70 44 76 29
> > _______________________________________________
> > Backgroundrb-devel mailing list
> > [email protected]
> > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
> >
>
--
--
Julien Genestoux
[EMAIL PROTECTED]
http://www.ouvre-boite.com
+1 (415) 254 7340
+33 (0)8 70 44 76 29
_______________________________________________
Backgroundrb-devel mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/backgroundrb-devel