Dear Robert,

Thanks for your timely help.  Following your instructions to activate the
bloomfilter setting seems to have eliminated the timeout problem, so that's
great!

"One caveat of using the bloomfilters is that you cannot (a) use wildcard
queries on the index (wildcards still work on the catalog though), and (b)
you cannot partition your updates between RLI server."

Just for clarification about the wildcard queries: "index" refers to the RLI
and "catalog" refers to the LRC?
Also, I'm not familiar with what you mean by partitioning updates, but it's
probably not something I have to worry about right now...

Thanks again,
Adam


On Fri, Aug 15, 2008 at 2:27 PM, Robert Schuler <[EMAIL PROTECTED]> wrote:

>  Hope you don't mind me cc:ing this to the list (standard procedure here).
>
> It looks like your timeouts are occurring on the Local Replica Cataog (LRC)
> ->updating-> Replica Location Index (RLI). You've bumped your timeout up to
> 120 which is a good thing to try. Since the timeouts (at least the ones that
> I can see in your log) are happening on a server-to-server update and since
> your catalog contains 70,000+ entries, this may be a good time to try to
> switch over to using the compressed "bloomfilter" updates. Currently, with
> your setup using uncompressed updates the LRC will send all 70,000 logical
> names (full strings) to the index (itself again). I'm not sure whether this
> will resolve the issues entirely, but I'm guessing that things are getting
> backlogged because the update gets timed out and then the thread is in limbo
> until the cleanup process kills it (which is also visible in the log).
>
> Here's how to switch over to bloomfilters.
>
> 1) start your server (in isolation I guess or whatever works)
> 2) use the admin tool to tell it to stop sending itself updates:
> globus-rls-admin -d rls://hostname rls://hostname
> 3) stop the server
> 4) in your globus-rls-server.conf change the bloomfilter setting from
> 'false' to 'true'.
> 5) restart your server
> 6) use the admin tool to tell it to start sending itself the compressed
> updates: globus-rls-admin -A rls://hostname rls://hostname
>
> Note: its possible to do this without ever stopping/restarting an RLS
> server, but for simplicity I'll just use these instructions.
>
> Once you've done the above, the RLS/LRC will create bloomfilters and will
> send them to the RLS/RLI. This should be much faster than sending the full
> lfn list update.
>
> One caveat of using the bloomfilters is that you cannot (a) use wildcard
> queries on the index (wildcards still work on the catalog though), and (b)
> you cannot partition your updates between RLI server.
>
> Of course, another approach would be to up your timeout even more. Then you
> could continue using the lfn list updates. Also, a bloomfilter update
> *could* take more than 120 secs as the bloomfilter grows in size, but this
> ought to be sufficient for your current catalog size.
>
> So, start with this, and see if it eliminates so many of those timeouts.
> And if lucky, it will also resolve the other issue. But I'm not certain.
>
> As another side note, I've tested using a SQLite database with up to 5M
> entries. It worked smoothly up to 1M - 2M entries and afterward gracefully
> degraded. The main issue I see with SQLite is that it doesn't handle lots of
> concurrent users, so if you have 20+ clients simultaneously hitting the RLS,
> the db will have problems.
>
> rob
>
>
> -----Original Message-----
> From: [EMAIL PROTECTED] on behalf of Adam Bazinet
> Sent: Fri 8/15/2008 10:00 AM
> To: Robert Schuler
> Subject: Re: [gt-user] RLS woes: database is locked
>
> Dear Robert,
>
> I hope all is well with you.  We enjoyed a period of prosperity with RLS
> where there were no issues, but now I'm afraid that once again I can't keep
> the server up and running for any extended period of time.  FYI, here is
> the
> current state:
>
> [EMAIL PROTECTED]:/export/work/globus-4.1.0>
> ../globus-4.0.6/bin/globus-rls-admin -S rlsn://asparagine
> Version:    4.6
> Uptime:     00:02:20
> LRC stats
>   update method: lfnlist
>   update method: bloomfilter
>   updates lfnlist:     rlsn://asparagine.umiacs.umd.edu:39281 last
> 12/31/69
> 19:00:00
>   lfnlist update interval: 86400
>   bloomfilter update interval: 900
>   numlfn: 71210
>   numpfn: 142159
>   nummap: 142159
> RLI stats
>   updated by: rlsn://asparagine.umiacs.umd.edu:39281 last 08/15/08
> 12:26:05
>   updated via lfnlists
>   numlfn: 71139
>   numlrc: 1
>   numsender: 1
>   nummap: 71139
>
> It has lots of entries that I can't afford to lose right now, so I can't
> very well scrap the sqlite database files and start over.  So when I say I
> can't get it to stay up, I mean one of two things happens:
>
> 1) by far the more common thing, I just get timeouts with any sort of RLS
> query using globus-rls-cli
> 2) the RLS server just crashes.
>
> Now, what I'll do is kill it off, bring it back up in isolation for a good
> 5-10 minutes (it seems happier that way), and then turn our Grid back on
> which immediately generates lots of (sometimes simultaneous) queries.  It
> may hold up for a while, but before long usually the timeouts start to
> occur.  Scanning through this old thread I decided to try the -dL3 option
> you suggested, and I'm attaching three log files that I generated during 3
> separate attempts at keeping the server up:
>
> 1) first attempt did not use the LD_ASSUME_KERNEL=2.4.1, server crashed
> 2) second attempt did not use the LD_ASSUME_KERNEL=2.4.1, timeout occurred
> 3) third attempt DID use LD_ASSUME..., timeouts occurred
>
> Usually looking at the end of the log is sufficient to see some of the
> problems, but I'm including all of it in case you want to look at our
> settings or what not.  I may not have had GLOBUS_ERROR_VERBOSE on, I just
> turned it on now.  Will that cause more information to be printed to this
> log, or to /var/log/messages?
>
> I really don't know what to do.  The next step, if I have to take it, would
> be to attempt to get it working with Postgres and somehow dump/transfer the
> existing data.  My hunch is that most of these problems are SQLite
> specific.  I don't have a good explanation as to why things broke after all
> this time, except we may have had more simultaneous queries lately.  Thanks
> for any ideas or help you can provide.
>
> Adam
>
>

Reply via email to