Hi Alexey, one thing that would be certainly useful is a more elaborate description of a bundle when you send it (I have this in my ~/.darcs/defaults: "send edit-description" since I otherwise keep forgetting to invoke send with --edit). I will try to review the patch, although I am not sure what was its intent.
Use cache while getting packed repository ----------------------------------------- > hunk ./src/Darcs/Commands/Optimize.lhs 27 > import Control.Applicative ( (<$>) ) > import Control.Monad ( when, unless ) > import Data.Maybe ( isJust ) > +import Data.List ( sort ) > import System.Directory ( getDirectoryContents, doesDirectoryExist, > hunk ./src/Darcs/Commands/Optimize.lhs 29 > - doesFileExist, renameFile ) > + doesFileExist, renameFile, getModificationTime ) > import System.IO.Unsafe ( unsafeInterleaveIO ) > import qualified Data.ByteString.Char8 as BS > import qualified Data.ByteString.Lazy as BL > hunk ./src/Darcs/Commands/Optimize.lhs 36 > > import Storage.Hashed.Darcs( decodeDarcsSize ) > > -import Darcs.Hopefully ( info ) > +import Darcs.Hopefully ( info, extractHash ) > import Darcs.Commands ( DarcsCommand(..), nodefaults ) > import Darcs.Arguments ( DarcsFlag( UpgradeFormat, UseHashedInventory, > Compress, UnCompress, > hunk ./src/Darcs/Commands/Optimize.lhs 73 > > -- imports for optimize --upgrade; to be tidied > import System.Directory ( createDirectoryIfMissing, removeFile ) > -import System.FilePath.Posix ( takeExtension, (</>), (<.>) ) > +import System.FilePath.Posix ( takeExtension, (</>), (<.>), takeFileName ) > > import Progress ( beginTedious, endTedious, tediousSize ) > import Darcs.Flags ( compression ) (just imports) > hunk ./src/Darcs/Commands/Optimize.lhs 148 > optimizeCmd origopts _ = do > when (UpgradeFormat `elem` origopts) optimizeUpgradeFormat > withRepoLock opts $- \repository -> do > - when (OptimizeHTTP `elem` origopts) doOptimizeHTTP > + when (OptimizeHTTP `elem` origopts) $ doOptimizeHTTP repository > if (OptimizePristine `elem` opts) > then doOptimizePristine repository > else do cleanRepository repository Just an extra parameter. > hunk ./src/Darcs/Commands/Optimize.lhs 372 > gzs <- filter ((== ".gz") . takeExtension) `fmap` getDirectoryContents > "." > mapM_ removeFile gzs > > -doOptimizeHTTP :: IO () > -doOptimizeHTTP = do > +doOptimizeHTTP :: RepoPatch p => Repository p C(r u t) -> IO () > +doOptimizeHTTP repo = do > rf <- either fail return =<< identifyRepoFormat "." > unless (formatHas HashedInventory rf) . fail $ > "Unsupported repository format:\n" ++ > hunk ./src/Darcs/Commands/Optimize.lhs 379 > " only hashed repositories can be optimized for HTTP" > createDirectoryIfMissing False packsDir > - ps <- dirContents' "patches" $ \x -> all (x /=) ["unrevert", "pending", > - "pending.tentative"] > + ps <- mapRL hashedPatchFileName . newset2RL <$> readRepo repo We use inventories to get a list of patches instead of listing directory contents -- this should be indeed more robust. OK. We could also do something like this to get a list of the actually relevant inventories: these tend to accumulate a fair amount of garbage, actually, so it may be even more useful in that case. Check out readInventories and readInventoryPrivate in Darcs.Repository.HashedRepo. (The latter reads a single inventory file, and gives you a list of PatchInfos and a Maybe a String (= hash) of the next inventory in the list. Nothing here means there are no more inventories. (I should probably turn that into a haddock. Hm.) > BL.writeFile (patchesTar <.> "part") . compress . write =<< > mapM fileEntry' ps > renameFile (patchesTar <.> "part") patchesTar > hunk ./src/Darcs/Commands/Optimize.lhs 383 > - let i = darcsdir </> "hashed_inventory" > - is <- dirContents "inventories" > - pr <- dirContents "pristine.hashed" > - BL.writeFile (basicTar <.> "part") . compress . write =<< > - mapM fileEntry' (i : (is ++ pr)) > + is <- sortByMTime =<< dirContents "inventories" > + writeFile (darcsdir </> "tmp-inventories") . unlines $ map takeFileName is > + pr <- sortByMTime =<< dirContents "pristine.hashed" > + writeFile (darcsdir </> "tmp-pristine") . unlines $ map takeFileName pr > + BL.writeFile (basicTar <.> "part") . compress . write =<< mapM fileEntry' ( > + [ darcsdir </> "tmp-inventories" > + , darcsdir </> "tmp-pristine" > + , darcsdir </> "hashed_inventory" > + ] ++ reverse pr ++ reverse is) This sorts the basic.tar.gz in a newest-first order (pristines first, then inventories). It did make me wonder though, whether we actually want the inventories in the basic.tar.gz... They are not fetched by ordinary lazy get, are they? It seems that tmp-pristine and tmp-inventories come with a list of files that are packed in the tarball. I am not sure this is necessary, since the tar itself already is a list with filenames in it? > renameFile (basicTar <.> "part") basicTar > hunk ./src/Darcs/Commands/Optimize.lhs 393 > + removeFile $ darcsdir </> "tmp-inventories" > + removeFile $ darcsdir </> "tmp-pristine" Get rid of the temporaries (could these be avoided by constructing these two lists in memory and feeding them to Tar?). > where > packsDir = darcsdir </> "packs" > basicTar = packsDir </> "basic.tar.gz" > hunk ./src/Darcs/Commands/Optimize.lhs 406 > dirContents d = dirContents' d $ const True > dirContents' d f = map ((darcsdir </> d) </>) . filter (\x -> > head x /= '.' && f x) <$> getDirectoryContents (darcsdir </> d) > + hashedPatchFileName x = case extractHash x of > + Left _ -> fail "unexpected unhashed patch" > + Right h -> darcsdir </> "patches" </> h > + sortByMTime xs = map snd . sort <$> mapM (\x -> (\t -> (t, x)) <$> > + getModificationTime x) xs Helpers. > hunk ./src/Darcs/Repository.hs 48 > ) where > > import System.Exit ( ExitCode(..), exitWith ) > -import Data.List ( isSuffixOf ) > import Data.Maybe( catMaybes ) > > import Darcs.Repository.State( readRecorded, readUnrecorded, readWorking, > unrecordedChanges > hunk ./src/Darcs/Repository.hs 77 > ) > import Darcs.Repository.Merge( tentativelyMergePatches, > considerMergeToWorking ) > import Darcs.Repository.Cache ( unionRemoteCaches, fetchFileUsingCache, > - speculateFileUsingCache, HashedDir(..), > Cache(..), CacheLoc(..), WritableOrNot(..)) > + speculateFileUsingCache, HashedDir(..), > Cache(..), > + CacheLoc(..), WritableOrNot(..), hashedDir ) > import Darcs.Patch.Set ( PatchSet(..), SealedPatchSet, newset2RL, newset2FL, > progressPatchSet ) > #ifdef GADT_WITNESSES > import Darcs.Patch.Set ( Origin ) > hunk ./src/Darcs/Repository.hs 86 > import URL ( maxPipelineLength ) > > import Control.Applicative ( (<$>) ) > +import Control.Concurrent ( forkIO ) > import Control.Monad ( unless, when ) > import System.Directory ( createDirectory, renameDirectory, > hunk ./src/Darcs/Repository.hs 89 > - createDirectoryIfMissing, renameFile ) > + createDirectoryIfMissing, renameFile, > doesFileExist ) > import System.IO.Error ( isAlreadyExistsError ) > > import qualified Darcs.Repository.DarcsRepo as DarcsRepo > hunk ./src/Darcs/Repository.hs 134 > import qualified Codec.Archive.Tar as Tar > import Codec.Compression.GZip ( compress, decompress ) > import qualified Data.ByteString.Char8 as BS > -import qualified Data.ByteString.Lazy as BL > +import qualified Data.ByteString.Lazy.Char8 as BL > > #include "impossible.h" > Imports. > hunk ./src/Darcs/Repository.hs 287 > createDirectoryIfMissing False $ toDir </> darcsdir </> "pristine.hashed" > createDirectoryIfMissing False $ toDir </> darcsdir </> "patches" > copySources toRepo fromDir > + Repo _ _ _ (DarcsRepository _ toCache3) <- > + identifyRepositoryFor fromRepo "." I am not sure why is this here and what is the effect? What is the difference from toCache2? > -- unpack inventory & pristine cache > hunk ./src/Darcs/Repository.hs 290 > - writeCompressed . Tar.read $ decompress b > + writeBasic toCache3 . Tar.read $ decompress b > createPristineDirectoryTree toRepo "." > -- pull new patches > us <- readRepo toRepo > hunk ./src/Darcs/Repository.hs 306 > -- get old patches > unless (any (`elem` opts) [Partial, Lazy, Ephemeral]) $ do > putInfo "Copying patches, to get lazy repository hit ctrl-C..." > + _ <- forkIO . fetchFiles toCache3 HashedPatchesDir . > + mapFL hashedPatchFileName $ newset2FL us > writeCompressed . Tar.read . decompress =<< fetchFileLazyPS > (fromPacksDir ++ > "patches.tar.gz") Uncachable > where > hunk ./src/Darcs/Repository.hs 311 > + writeBasic c (Tar.Next is (Tar.Next pr (Tar.Next hi xs))) = do > + case map Tar.entryContent [is, pr, hi] of > + [Tar.NormalFile is' _, Tar.NormalFile pr' _, Tar.NormalFile hi' _] -> > do > + _ <- forkIO $ do > + fetchFiles c HashedInventoriesDir . lines $ BL.unpack is' > + fetchFiles c HashedPristineDir . lines $ BL.unpack pr' > + BL.writeFile (darcsdir </> "hashed_inventory") hi' > + writeCompressed xs > + _ -> fail "Unexpected non-file tar entry" > + writeBasic _ _ = fail "Error in basic tar file" For all I can tell, this extracts the tmp-{pristine,inventories} lists from the tarball and starts to copy them from cache. We do this concurrently with unpacking the tarball. Seems that first wins. I am however wondering, if the cache we are using here only contains local sources? (I.e. caches and possibly repos...) In that case, it would make some sense, although I don't see any exception handling in fetchFiles (below). The overall idea is interesting, but I don't see how it saves any bandwidth (or time): we still need to download the complete basic.tar.gz, right? Oh. I see now that you stop the unpacking and the download once you have a *first* hit... I think that's wrong though. It is, if nothing else, prone to races: either of the threads could win. If the tarball loses, it is not used at all, which could happen even if you don't have anything cached. Nevertheless, it is still downloaded, at least until the process is terminated, slowing down the process of getting any pristine content and patches that may be missing in the cache. Moreover, it could also happen that even though you have almost everything in the cache, the tarball is fully downloaded. So instead it would be good if we could stop downloading once we have all the files. Also, a solution like that wouldn't need the tmp-{pristine,inventories} lists nor the concurrency, since it would need to parse the directory entries and inventories anyway and could see whether anything is still missing. The tarball would come in the mtime order as it already does (but hashed_inventory needs to come first, in any case). Then, you could just maintain a list of hashes that are still missing as you are processing the tarball. Once that list is empty, you stop. The list starts with pristine root, which is stored in hashed_inventory, and every time an item is removed from the list, anything it mentions (this is where you need to parse things) and we don't have yet (you look in the cache for those) is added to it. There are utilities in Storage.Hashed.Darcs that should help with that. Also, anything you unpack from the tarball should be linked into the cache if it's not there yet. If it is, the tarball copy should be probably ignored and you should hardlink from the cache instead (saves space). > writeCompressed Tar.Done = return () > writeCompressed (Tar.Next x xs) = case Tar.entryContent x of > Tar.NormalFile x' _ -> do > hunk ./src/Darcs/Repository.hs 325 > let p = Tar.entryPath x > - withTemp $ \p' -> do > - BL.writeFile p' $ if "hashed_inventory" `isSuffixOf` p > - then x' > - else compress x' > - renameFile p' p > - writeCompressed xs > + ex <- doesFileExist p > + unless ex $ do > + withTemp $ \p' -> do > + BL.writeFile p' $ compress x' > + renameFile p' p > + writeCompressed xs Avoids clobbering and removes the special handling of hashed_inventory (that one is now handled in writeBasic). > _ -> fail "Unexpected non-file tar entry" > writeCompressed (Tar.Fail e) = fail e > putInfo = when (not $ Quiet `elem` opts) . putStrLn > hunk ./src/Darcs/Repository.hs 334 > - > + fetchFiles _ _ [] = return () > + fetchFiles c d (f:fs) = do > + x <- doesFileExist $ hashedDir d </> f > + unless x $ do > + fetchFileUsingCache c d f > + fetchFiles c d fs > + hashedPatchFileName x = case extractHash x of > + Left _ -> fail "unexpected unhashed patch" > + Right h -> h Helpers. Yours, Petr. PS: I haven't had the chance to look at the performance of the patch yet. I won't apply it at least until I hear from you regarding the above comments -- I will try to benchmark it and see if it works at least in the usual cases. Something will nevertheless have to be done about the download of the tarball in cases where it is never used. _______________________________________________ darcs-users mailing list darcs-users@darcs.net http://lists.osuosl.org/mailman/listinfo/darcs-users