Hi Alexey,

one thing that would be certainly useful is a more elaborate description
of a bundle when you send it (I have this in my ~/.darcs/defaults: "send
edit-description" since I otherwise keep forgetting to invoke send with
--edit). I will try to review the patch, although I am not sure what was
its intent.

Use cache while getting packed repository
-----------------------------------------

> hunk ./src/Darcs/Commands/Optimize.lhs 27
>  import Control.Applicative ( (<$>) )
>  import Control.Monad ( when, unless )
>  import Data.Maybe ( isJust )
> +import Data.List ( sort )
>  import System.Directory ( getDirectoryContents, doesDirectoryExist,
> hunk ./src/Darcs/Commands/Optimize.lhs 29
> -                          doesFileExist, renameFile )
> +                          doesFileExist, renameFile, getModificationTime )
>  import System.IO.Unsafe ( unsafeInterleaveIO )
>  import qualified Data.ByteString.Char8 as BS
>  import qualified Data.ByteString.Lazy as BL
> hunk ./src/Darcs/Commands/Optimize.lhs 36
>  
>  import Storage.Hashed.Darcs( decodeDarcsSize )
>  
> -import Darcs.Hopefully ( info )
> +import Darcs.Hopefully ( info, extractHash )
>  import Darcs.Commands ( DarcsCommand(..), nodefaults )
>  import Darcs.Arguments ( DarcsFlag( UpgradeFormat, UseHashedInventory,
>                                      Compress, UnCompress,
> hunk ./src/Darcs/Commands/Optimize.lhs 73
>  
>  -- imports for optimize --upgrade; to be tidied
>  import System.Directory ( createDirectoryIfMissing, removeFile )
> -import System.FilePath.Posix ( takeExtension, (</>), (<.>) )
> +import System.FilePath.Posix ( takeExtension, (</>), (<.>), takeFileName )
>  
>  import Progress ( beginTedious, endTedious, tediousSize )
>  import Darcs.Flags ( compression )
(just imports)

> hunk ./src/Darcs/Commands/Optimize.lhs 148
>  optimizeCmd origopts _ = do
>      when (UpgradeFormat `elem` origopts) optimizeUpgradeFormat
>      withRepoLock opts $- \repository -> do
> -    when (OptimizeHTTP `elem` origopts) doOptimizeHTTP
> +    when (OptimizeHTTP `elem` origopts) $ doOptimizeHTTP repository
>      if (OptimizePristine `elem` opts)
>         then doOptimizePristine repository
>         else do cleanRepository repository
Just an extra parameter.

> hunk ./src/Darcs/Commands/Optimize.lhs 372
>        gzs <- filter ((== ".gz") . takeExtension) `fmap` getDirectoryContents 
> "."
>        mapM_ removeFile gzs
>  
> -doOptimizeHTTP :: IO ()
> -doOptimizeHTTP = do
> +doOptimizeHTTP :: RepoPatch p => Repository p C(r u t) -> IO ()
> +doOptimizeHTTP repo = do
>    rf <- either fail return =<< identifyRepoFormat "."
>    unless (formatHas HashedInventory rf) . fail $
>      "Unsupported repository format:\n" ++
> hunk ./src/Darcs/Commands/Optimize.lhs 379
>      "  only hashed repositories can be optimized for HTTP"
>    createDirectoryIfMissing False packsDir
> -  ps <- dirContents' "patches" $ \x -> all (x /=) ["unrevert", "pending",
> -    "pending.tentative"]
> +  ps <- mapRL hashedPatchFileName . newset2RL <$> readRepo repo
We use inventories to get a list of patches instead of listing directory
contents -- this should be indeed more robust. OK. We could also do
something like this to get a list of the actually relevant inventories:
these tend to accumulate a fair amount of garbage, actually, so it may
be even more useful in that case. Check out readInventories and
readInventoryPrivate in Darcs.Repository.HashedRepo.  (The latter reads
a single inventory file, and gives you a list of PatchInfos and a Maybe
a String (= hash) of the next inventory in the list. Nothing here means
there are no more inventories. (I should probably turn that into a
haddock. Hm.)

>    BL.writeFile (patchesTar <.> "part") . compress . write =<<
>      mapM fileEntry' ps
>    renameFile (patchesTar <.> "part") patchesTar
> hunk ./src/Darcs/Commands/Optimize.lhs 383
> -  let i = darcsdir </> "hashed_inventory"
> -  is <- dirContents "inventories"
> -  pr <- dirContents "pristine.hashed"
> -  BL.writeFile (basicTar <.> "part") . compress . write =<<
> -    mapM fileEntry' (i : (is ++ pr))
> +  is <- sortByMTime =<< dirContents "inventories"
> +  writeFile (darcsdir </> "tmp-inventories") . unlines $ map takeFileName is
> +  pr <- sortByMTime =<< dirContents "pristine.hashed"
> +  writeFile (darcsdir </> "tmp-pristine") . unlines $ map takeFileName pr
> +  BL.writeFile (basicTar <.> "part") . compress . write =<< mapM fileEntry' (
> +    [ darcsdir </> "tmp-inventories"
> +    , darcsdir </> "tmp-pristine"
> +    , darcsdir </> "hashed_inventory"
> +    ] ++ reverse pr ++ reverse is)
This sorts the basic.tar.gz in a newest-first order (pristines first,
then inventories). It did make me wonder though, whether we actually
want the inventories in the basic.tar.gz... They are not fetched by
ordinary lazy get, are they?

It seems that tmp-pristine and tmp-inventories come with a list of files
that are packed in the tarball. I am not sure this is necessary, since
the tar itself already is a list with filenames in it?

>    renameFile (basicTar <.> "part") basicTar
> hunk ./src/Darcs/Commands/Optimize.lhs 393
> +  removeFile $ darcsdir </> "tmp-inventories"
> +  removeFile $ darcsdir </> "tmp-pristine"
Get rid of the temporaries (could these be avoided by constructing these
two lists in memory and feeding them to Tar?).

>   where
>    packsDir = darcsdir </> "packs"
>    basicTar = packsDir </> "basic.tar.gz"
> hunk ./src/Darcs/Commands/Optimize.lhs 406
>    dirContents d = dirContents' d $ const True
>    dirContents' d f = map ((darcsdir </> d) </>) . filter (\x ->
>      head x /= '.' && f x) <$> getDirectoryContents (darcsdir </> d)
> +  hashedPatchFileName x = case extractHash x of
> +    Left _ -> fail "unexpected unhashed patch"
> +    Right h -> darcsdir </> "patches" </> h
> +  sortByMTime xs = map snd . sort <$> mapM (\x -> (\t -> (t, x)) <$>
> +    getModificationTime x) xs
Helpers.

> hunk ./src/Darcs/Repository.hs 48
>      ) where
>  
>  import System.Exit ( ExitCode(..), exitWith )
> -import Data.List ( isSuffixOf )
>  import Data.Maybe( catMaybes )
>  
>  import Darcs.Repository.State( readRecorded, readUnrecorded, readWorking, 
> unrecordedChanges
> hunk ./src/Darcs/Repository.hs 77
>      )
>  import Darcs.Repository.Merge( tentativelyMergePatches, 
> considerMergeToWorking )
>  import Darcs.Repository.Cache ( unionRemoteCaches, fetchFileUsingCache,
> -                                speculateFileUsingCache, HashedDir(..), 
> Cache(..), CacheLoc(..), WritableOrNot(..))
> +                                speculateFileUsingCache, HashedDir(..), 
> Cache(..),
> +                                CacheLoc(..), WritableOrNot(..), hashedDir )
>  import Darcs.Patch.Set ( PatchSet(..), SealedPatchSet, newset2RL, newset2FL, 
> progressPatchSet )
>  #ifdef GADT_WITNESSES
>  import Darcs.Patch.Set ( Origin )
> hunk ./src/Darcs/Repository.hs 86
>  import URL ( maxPipelineLength )
>  
>  import Control.Applicative ( (<$>) )
> +import Control.Concurrent ( forkIO )
>  import Control.Monad ( unless, when )
>  import System.Directory ( createDirectory, renameDirectory,
> hunk ./src/Darcs/Repository.hs 89
> -                          createDirectoryIfMissing, renameFile )
> +                          createDirectoryIfMissing, renameFile, 
> doesFileExist )
>  import System.IO.Error ( isAlreadyExistsError )
>  
>  import qualified Darcs.Repository.DarcsRepo as DarcsRepo
> hunk ./src/Darcs/Repository.hs 134
>  import qualified Codec.Archive.Tar as Tar
>  import Codec.Compression.GZip ( compress, decompress )
>  import qualified Data.ByteString.Char8 as BS
> -import qualified Data.ByteString.Lazy as BL
> +import qualified Data.ByteString.Lazy.Char8 as BL
>  
>  #include "impossible.h"
>  
Imports.

> hunk ./src/Darcs/Repository.hs 287
>    createDirectoryIfMissing False $ toDir </> darcsdir </> "pristine.hashed"
>    createDirectoryIfMissing False $ toDir </> darcsdir </> "patches"
>    copySources toRepo fromDir
> +  Repo _ _ _ (DarcsRepository _ toCache3) <-
> +    identifyRepositoryFor fromRepo "."
I am not sure why is this here and what is the effect? What is the
difference from toCache2?

>    -- unpack inventory & pristine cache
> hunk ./src/Darcs/Repository.hs 290
> -  writeCompressed . Tar.read $ decompress b
> +  writeBasic toCache3 . Tar.read $ decompress b
>    createPristineDirectoryTree toRepo "."
>    -- pull new patches
>    us <- readRepo toRepo
> hunk ./src/Darcs/Repository.hs 306
>    -- get old patches
>    unless (any (`elem` opts) [Partial, Lazy, Ephemeral]) $ do
>      putInfo "Copying patches, to get lazy repository hit ctrl-C..."
> +    _ <- forkIO . fetchFiles toCache3 HashedPatchesDir .
> +      mapFL hashedPatchFileName $ newset2FL us
>      writeCompressed . Tar.read . decompress =<< fetchFileLazyPS 
> (fromPacksDir ++
>        "patches.tar.gz") Uncachable
>   where
> hunk ./src/Darcs/Repository.hs 311
> +  writeBasic c (Tar.Next is (Tar.Next pr (Tar.Next hi xs))) = do
> +    case map Tar.entryContent [is, pr, hi] of
> +      [Tar.NormalFile is' _, Tar.NormalFile pr' _, Tar.NormalFile hi' _] -> 
> do
> +        _ <- forkIO $ do
> +          fetchFiles c HashedInventoriesDir . lines $ BL.unpack is'
> +          fetchFiles c HashedPristineDir . lines $ BL.unpack pr'
> +        BL.writeFile (darcsdir </> "hashed_inventory") hi'
> +        writeCompressed xs
> +      _ -> fail "Unexpected non-file tar entry"
> +  writeBasic _ _ =  fail "Error in basic tar file"
For all I can tell, this extracts the tmp-{pristine,inventories} lists
from the tarball and starts to copy them from cache. We do this
concurrently with unpacking the tarball. Seems that first wins. I am
however wondering, if the cache we are using here only contains local
sources? (I.e. caches and possibly repos...) In that case, it would make
some sense, although I don't see any exception handling in fetchFiles
(below). The overall idea is interesting, but I don't see how it saves
any bandwidth (or time): we still need to download the complete
basic.tar.gz, right?

Oh. I see now that you stop the unpacking and the download once you have
a *first* hit... I think that's wrong though. It is, if nothing else,
prone to races: either of the threads could win. If the tarball loses,
it is not used at all, which could happen even if you don't have
anything cached. Nevertheless, it is still downloaded, at least until
the process is terminated, slowing down the process of getting any
pristine content and patches that may be missing in the cache. Moreover,
it could also happen that even though you have almost everything in the
cache, the tarball is fully downloaded.

So instead it would be good if we could stop downloading once we have
all the files. Also, a solution like that wouldn't need the
tmp-{pristine,inventories} lists nor the concurrency, since it would
need to parse the directory entries and inventories anyway and could see
whether anything is still missing. The tarball would come in the mtime
order as it already does (but hashed_inventory needs to come first, in
any case). Then, you could just maintain a list of hashes that are still
missing as you are processing the tarball. Once that list is empty, you
stop. The list starts with pristine root, which is stored in
hashed_inventory, and every time an item is removed from the list,
anything it mentions (this is where you need to parse things) and we
don't have yet (you look in the cache for those) is added to it. There
are utilities in Storage.Hashed.Darcs that should help with that.

Also, anything you unpack from the tarball should be linked into the
cache if it's not there yet. If it is, the tarball copy should be
probably ignored and you should hardlink from the cache instead (saves
space).

>    writeCompressed Tar.Done = return ()
>    writeCompressed (Tar.Next x xs) = case Tar.entryContent x of
>      Tar.NormalFile x' _ -> do
> hunk ./src/Darcs/Repository.hs 325
>        let p = Tar.entryPath x
> -      withTemp $ \p' -> do
> -        BL.writeFile p' $ if "hashed_inventory" `isSuffixOf` p
> -          then x'
> -          else compress x'
> -        renameFile p' p
> -      writeCompressed xs
> +      ex <- doesFileExist p
> +      unless ex $ do
> +        withTemp $ \p' -> do
> +          BL.writeFile p' $ compress x'
> +          renameFile p' p
> +        writeCompressed xs
Avoids clobbering and removes the special handling of hashed_inventory
(that one is now handled in writeBasic).

>      _ -> fail "Unexpected non-file tar entry"
>    writeCompressed (Tar.Fail e) = fail e
>    putInfo = when (not $ Quiet `elem` opts) . putStrLn
> hunk ./src/Darcs/Repository.hs 334
> -
> +  fetchFiles _ _ [] = return ()
> +  fetchFiles c d (f:fs) = do
> +    x <- doesFileExist $ hashedDir d </> f
> +    unless x $ do
> +      fetchFileUsingCache c d f
> +      fetchFiles c d fs
> +  hashedPatchFileName x = case extractHash x of
> +    Left _ -> fail "unexpected unhashed patch"
> +    Right h -> h
Helpers.

Yours,
   Petr.

PS: I haven't had the chance to look at the performance of the patch
yet. I won't apply it at least until I hear from you regarding the above
comments -- I will try to benchmark it and see if it works at least in
the usual cases. Something will nevertheless have to be done about the
download of the tarball in cases where it is never used.
_______________________________________________
darcs-users mailing list
darcs-users@darcs.net
http://lists.osuosl.org/mailman/listinfo/darcs-users

Reply via email to