Re: [darcs-users] [patch310] Resolve issue 1599: automatically expire unused caches

Eric Kow Thu, 22 Jul 2010 08:51:08 -0700

On Thu, Jul 22, 2010 at 06:12:24 +0000, Adolfo Builes wrote:
> Wed Jul 21 22:29:45 COT 2010  builes.ado...@googlemail.com
>   * Resolve issue 1599: automatically expire unused caches


Resolve issue 1599: automatically expire unused caches
------------------------------------------------------
> builes.ado...@googlemail.com**20100722032945
>  Ignore-this: 38455634b5648ed7555380336ea74464
> ] hunk ./src/Darcs/Global.hs 30
>                        whenDebugMode, withDebugMode, setDebugMode,
>                        debugMessage, debugFail, putTiming,
>                        addCRCWarning, getCRCWarnings, resetCRCWarnings,
> -                      darcsdir
> +                      addBadCache, getBadCacheList, isBadCache, darcsdir,
> +                      isTrustedHttp, addTrustedHttp

> hunk ./src/Darcs/Global.hs 73
>          Just actions <- swapMVar atexitActions Nothing
>          -- from now on atexit will not register new actions
>          mapM_ runAction actions
> +        badCaches <- getBadCacheList
> +        when ( not $ null badCaches ) $ reportBadCaches badCaches

OK, so after running all of the atExitActions, we print out our
accumulated warnings about caches (should we worry about what
happens if one of the atexit actions fails)?

> -
> +    reportBadCaches caches = do
> +                               if length caches > 1
> +                                 then
> +                                  hPutStrLn stderr "\nI could not reach the 
> following repositories (listed in _darcs/prefs/sources):"
> +                                 else
> +                                  hPutStrLn stderr "\nI could not reach the 
> following repository (listed in _darcs/prefs/sources):"

For what it's worth, this could be more simply expressed using the
English module.  It lets you write something like

 "I could not reach the following"
 ++ englishNum (length caches) (Noun "repository") "(list in in ...)"

It looks like you'd have to modify Noun to account for this.
I'll submit a patch.

It may also be a good idea to give the user some advice about what
to do here.

> +                               mapM_ (hPutStrLn stderr) caches
> +                               hPutStrLn stderr ""

Alternatively, hPutStrLn (unlines caches ++ "\n")

> +type BadCacheList = [String]
> +{- NOINLINE _badCacheList -}
> +_badCacheList :: IORef BadCacheList
> +_badCacheList = unsafePerformIO $ newIORef []
> +
> +addBadCache :: String -> IO ()
> +addBadCache cache = modifyIORef _badCacheList (cache:)
> +
> +getBadCacheList :: IO [String]
> +getBadCacheList = readIORef _badCacheList
> +
> +{- NOINLINE isBadCache -}
> +isBadCache :: String -> Bool
> +isBadCache cache = unsafePerformIO $ do badCaches <- getBadCacheList
> +                                        return (cache `elem` badCaches)

Is it really necessary for this to be String -> Bool?
Why not String -> IO Bool?

Alternatively, why not IO (String -> Bool)?

I'm also a bit sceptical about the need for a type synonym.
Maybe just passing around [String] is clearer here.

> +type TrustedHttps = [String]
> +{- NOINLINE _trustedHttpList -}
> +_trustedHttpList :: IORef TrustedHttps
> +_trustedHttpList = unsafePerformIO $ newIORef []

So you maintain a list of URLs which you managed to reach the
first time you contacted them this session.

'Trust' is probably not the right notion to use here (since I'd
be thinking more in terms of security or something like that).
How about "reachable" instead?

> +addTrustedHttp :: String -> IO ()
> +addTrustedHttp http = modifyIORef _trustedHttpList (http:)
> +
> +getTrustedHttps :: IO TrustedHttps
> +getTrustedHttps = readIORef _trustedHttpList
> +
> +{- NOINLINE  isTrustedHttp -}
> +isTrustedHttp :: String -> Bool
> +isTrustedHttp http = unsafePerformIO $ do trustedHttps <- getTrustedHttps
> +                                          return (http `elem` trustedHttps)

Same complaint about possibly unnecessary unsafePerformIO.  I mean, we
certainly use it to achieve tracking of some global information in
Darcs, but it's something we should be be somewhat reluctant to do lest
Darcs surprise us in the future.

> +-- | Returns the sources of a given cache.
> +cacheSource :: CacheLoc -> String
> +cacheSource (Cache _ _ s) = s

Potentially useful helper function.  Seems like we could alternatively
modify the definition of Cache to use the curly brace syntax.

>  fetchFileUsingCache :: Cache -> HashedDir -> String -> IO (String, 
> B.ByteString)
> -fetchFileUsingCache = fetchFileUsingCachePrivate Anywhere
> +fetchFileUsingCache c h s = do result <- fetchFileUsingCachePrivate Anywhere 
> c h s
> +                               return result

This change does not seem necessary.  What was wrong with the
eta-reduced version of this code?

>            sfuc [] _ = return ()
> -          sfuc (c:cs) out | not $ writable c =
> +          sfuc (c:cs) out | (not $ badCache c) && (not $ writable c) =

> -                 then speculateFileOrUrl (fn c) out
> +                 then speculateFileOrUrl (fn c) out `catchNonSignal` (\e -> 
> do
> +                                       addToBadCaches (show e) c)

> -                 else copyFileOrUrl [] (fn c) out Cachable
> +                 else copyFileOrUrl [] (fn c) out Cachable `catchNonSignal` 
> (\e -> do
> +                                       addToBadCaches (show e) c)

OK, so if something goes wrong when speculating on or copying a file,
(but it's not just the user hitting ^-c), we note the cache down as
being problematic.
>  
> +-- | Checks if a given cache needs to be added to the list of bad caches.
> +-- It receives an error caught during execution and the cache.
> +-- For a local cache, if the given source doesn't exist anymore, it is added.
> +-- For HTTP sources if the error is timeout, it is added, if not we check if 
> the source
> +-- still exist, if doesn't exist we added to the list of bad caches.
> +-- For SSH we check if the server is reachable.
> +-- The entries which get added to the cache are no longer tried for the rest 
> of the command.
> +addToBadCaches :: String -> CacheLoc -> IO ()

Thanks for haddocking this.

It seems like addToBadCaches is may be a bit of misnomer first this
function primarily tests if a cache is reachable, and second because it
adds the cache entry to either the white or the blacklist.  But I'm not
sure; maybe I misunderstand the code.  After all, you do only call it
when something goes wrong...

Style point:
The code below is very heavily indented, which makes it harder to read
and review without a wide screen or window.  It's a tricky trade-off
between too much and not enough, so I don't really have any answers.
Perhaps this style could help here

   addToBadCaches e cache
    | condition 1 = 
        ...
    | condition 2 =
        ...

Note the retracted guards and also starting each case on a new line

> +addToBadCaches e cache | isFile (cacheSource cache) = case "No such file or 
> directory" `isInfixOf`  (map toLower e) of
> +                                                        True  -> do
> +                                                                  exist <- 
> doesDirectoryExist $ cacheSource $ cache
> +                                                                  when (not 
> exist) $ addBadCache . cacheSource $ cache
> +                                                        False -> return ()

1. I'm not sure what you gain from pattern-matching on the
   Bool when you could just use an if-then-else.

2. Do we not have any exception types to work with here?  Matching
   on the string seems really fragile.

3. Case in point, you're matching "Foo" against (map toLower "foo")
   which means that this code path is never visited.

> +  | isUrl  (cacheSource cache) = case "timeout" `isInfixOf`  (map toLower e) 
> of
> +                                   True  -> addBadCache . cacheSource $ cache
> +                                   False -> case isTrustedHttp (cacheSource 
> cache) of
> +                                            True  -> return  ()
> +                                            False -> do
> +                                                      let url = if last 
> (cacheSource cache) == '/'
> +                                                                 then 
> cacheSource cache
> +                                                                 else 
> (cacheSource cache) ++ "/"
> +                                                      rsp <- simpleHTTP 
> (getRequest url )
> +                                                      do case rsp of
> +                                                          Left _ ->  return 
> ()
> +                                                          Right response -> 
> if rspCode response /= (2,0,0)
> +                                                                            
> then addBadCache . cacheSource $ cache
> +                                                                            
> else addTrustedHttp (cacheSource cache) >> return ()

OK earlier we see that this function is called if we receive some
exception when fetching over HTTP.

If the exception is not a timeout, we test if the URL is reachable and
whitelist it if so (the whitelist avoids future testing).  But why do we
do this?  And doesn't that confuse Darcs into thinking it successfully
fetched a file when it did not?  Have you tested the actual case where
you get an exception but not a timeout?

Also, are there not web server configurations where
http://example.com/foo gives you a 404, but
http://example.com/foo/_darcs/foo gives you a result?  What happens in
those cases?  In other words, does the test of trying to fetch
http://example.com/foo/ actually make sense?

Also, why is it important to normalise the URL to have a trailing slash?
Is it just to avoid duplicate entries in the whitelist?

Code tidiness:

1. It could be useful to look into reducing the amount of indentation
   you use.  It's a tricky trade-off between too much and not enough,
   so I don't really have any answers.  Perhaps this style could help here

   addToBadCaches e cache
    | condition 1 = 
        ...
    | condition 2 =
        ...

   It's all about making the code as easy to read as possible.

2. It seems like you should just name a source = cacheSource cache,
   which I think will make the code a lot simpler looking

3. Why does addTrustedHttp need to be followed by return () if it already
   returns ()?

> +                       | isSsh  (cacheSource cache) =  checkSshError 
> (cacheSource cache) (getHttpUrl (cacheSource cache))

One principle I saw off some blog post somewhere is that when you have
a single unit of code, the code in that unit should tend to reside on
the same level of abstraction.

So a function like

  if foo
     then do low-level-A1
             low-level-A2
             low-level-A3
     else high-level-B

is a bit odd, when you could instead write something like

  if foo
     then do low-level-A1
             low-level-A2
             low-level-A3
     else do low-level-B1
             low-level-B2
             
OR alternatively

  if foo
     then do high-level-A
     else do high-level-B

But not a mix.  I think why this is useful for readability is that it
makes it very clear that A and B are on the same sort of level of
operation.  Anyway, don't take this sort of advice /too/ seriously.
It's one of these fuzzy things that I suspect you develop a feel for
over time.

In this particular case, you've chosen to break checkSshError into
a separate function, but not the HTTP stuff.  Why?

> +                      | otherwise                  = fail $ "unknown 
> transport protocol for: " ++ (cacheSource cache)

> +-- | Helper function to check reachability of a ssh source.
> +checkSshError :: String -> String -> IO ()
> +checkSshError source url = do
> +                      simpleHTTP (getRequest url )
> +                        `catchNonSignal` (\e -> do
> +                                             case "timeout" `isInfixOf`  
> (map toLower (show e)) of
> +                                               True  -> (addBadCache source) 
> >> return (Left (ErrorMisc "timeout"))
> +                                               False -> return () >> return 
> (Left (ErrorMisc "Unkown error")))
> +                      return ()

> +-- | Given a SSH source it returns the server address appended with HTTP://
> +-- i.e: getHttpUrl "u...@darcs.net:repo" returns "http://darcs.net";
> +getHttpUrl :: String -> String
> +getHttpUrl source = "http://"; ++ ((\(_:xs) -> takeWhile (/=':') xs) . 
> dropWhile (/='@') $ source)


Why does checkSshError take both the source and url argument?  Is there
a case where url cannot be systematically derived from the source?

What happens to the Either type that you return in your handler?
Doesn't it get discarded?

Why are we going about things in this roundabout way of checking the SSH
path by converting the source name to an HTTP URI and fetching it over
HTTP?  What about cases where people have a working SSH server but not
necessarily an HTTP server on a given machine?

>      where ffuc (c:cs)
> -           | not (writable c) && (Anywhere == fromWhere || isFile (fn c)) =
> +           | not (writable c) && (Anywhere == fromWhere || isFile (fn c)) && 
> (not $ badCache c) =

This seems ok although incidentally,
  not (badCache c)
is probably simpler than
  (not $ badCache c)

> -              `catchall` ffuc cs
> +              `catchNonSignal` (\e -> do
> +                                       addToBadCaches (show e) c
> +                                       ffuc cs )

> -              `catchall` do (fname,x) <- ffuc cs
> -                            do createCache c subdir
> -                               createLink fname (fn c)
> -                               return (fn c, x)
> -                             `catchall`
> -                             do gzWriteFilePS (fn c) x `catchall` return ()
> -                                return (fname,x)
> +              `catchNonSignal` (\ e -> do addToBadCaches (show e) c ; 
> (fname,x) <- ffuc cs
> +                                          do createCache c subdir
> +                                             createLink fname (fn c)
> +                                             return (fn c, x)
> +                                           `catchall`
> +                                           do gzWriteFilePS (fn c) x 
> `catchall` return ()
> +                                              return (fname,x))

This block of code actually has very few changes in it, but the fact
would have been more apparent if you tried to avoid the re-indentation
(not always possible).

In this case, you could have

-              `catchall` do (fname,x) <- ffuc cs
-                            do createCache c subdir
-                               createLink fname (fn c)
...
+              `catchNonSignal` (\ e ->
+                         do addToBadCaches (show e) c
+                            (fname,x) <- ffuc cs
+                            do createCache c subdir

I also took the liberty of simplifying away the semicolon.

> +-- | Checks if a given cache is in the list of bad caches.
> +badCache :: CacheLoc -> Bool
> +badCache = isBadCache . cacheSource

Do you really use this?
>  import Data.Maybe ( isJust, catMaybes )
>  import Control.Monad ( msum )

> -    dff <- fetchFilePS (repo ++ "/" ++ df) Cachable `catchall` return B.empty
> +    (timeOut,dff) <- (fetchFilePS (repo ++ "/" ++ df) Cachable >>= \ r -> 
> return (False,r))
> +                    `catchNonSignal` (\ e -> return (isTimeOut e,B.empty))

This may be more readable using do notation

     (timeOut,dff) <- do r <- fetchFilePS (repo ++ "/" ++ df) Cachable
                         return (False,r)
                      `catchNonSignal` (\ e -> return (isTimeOut e,B.empty))

The same sort of worry as above.  What happens if your exception is not
a timeout?  I'm concerned that you succeed here, where what you really
want to do maybe is re-throw the exception.

>      -- below is a workaround for servers that don't return a 404 on 
> nonexistent files
> hunk ./src/Darcs/Repository/Format.hs 57
> -    rf <- if B.null dff || isJust (BC.elemIndex '<' dff)
[snip]
> -          else return $ Right $ parseRepoFormat dff
> +    rf <- case timeOut of
> +          False ->
> +               if B.null dff || isJust (BC.elemIndex '<' dff)
[snip indentation change]
> +               else return $ Right $ parseRepoFormat dff
> +          True  -> return $ Left $ "Failed establishing a connection with 
> "++ repo ++ " Timeout was reached"

>                                    (\e -> return (Left (prettyException e)))
> +          isTimeOut e = "timeout" `isInfixOf` map toLower (show e)

It seems like the notion of whether something is a timeout or not (you
check this by looking at the exception text) could be refactored.
I guess it's not a big deal if it is though.

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, please try +44 (0)1273 64 2905.

signature.asc
Description: Digital signature

_______________________________________________
darcs-users mailing list
darcs-users@darcs.net
http://lists.osuosl.org/mailman/listinfo/darcs-users

Re: [darcs-users] [patch310] Resolve issue 1599: automatically expire unused caches

Reply via email to