Ok, let's take a step back. Here are the goals for this wad:
1) The most important and clearest goal: make the layout so that it doesn't take forever to remove the directory hierarchy (and at the same time improve rsync performance).
That is the singular goal of this wad.
Now, because I was touching code in this area, and I know we might want other layouts in the future (possible but unlikely), need to take different hash functions as input (a certainty, but the details of how we do this also aren't this wad), may want some kind of automatic cache explusion (see my arguments about why I think this is the wrong place to do it), I chose to add a second goal:
2) Centralize the insertion, lookup, and deletion of files.
Which, in my mind, eased the work of doing those other features I can see we may eventually want (but again, aren't this wad). I also thought it would add consistency to our error handling, but perhaps I'm wrong there too. If I'm wrong about this, I'm happy to remove the centralization and go back to the approach we used before. In my opinion, it'll be uglier and there will be more duplication of code, but I'm willing to do that. What I'm not willing to do is design or implement those other features in this wad. I can't see that anything I've done will make life more difficult for who ever tackles those features, in fact, I think I've made life easier for them. If you have a specific example where something I've changed will make things harder, please let me know.


[email protected] wrote:
On Tue, Sep 29, 2009 at 05:55:31PM -0700, Brock Pytlik wrote:
[snip]
Right now, I don't allow the FileManager's constructor to choose the layout. That doesn't mean it won't in the future. At one point, it was choosing the layout based on what was stored on disk, but I decided that over complicated things for no good reason at this point. Why would an image know what layouts it supports? Part of the point here is to encapsulate the file layout and not make the image deal with it, so that we can enable other functionality later if we deem it appropriate (like some form of cache clean up as you mentioned before).

I'm trying to ascertain what abstractions you're using, and determine
the modularity of the underlying design.  The constructor is part of the
FileManager object's interface.  If you're stating that the layout is
encapsulated by the FileManger, then you're telling me that layout is an
implementation detail of FileManager.  However, I don't think that's
truly the case.  The file layout is eventually going to become part of
the on-disk format, and will be used by any packaging component that
needs to interact with the filesystem.  Even if the abstraction that the
FileManager implements doesn't expose the layout, implementors who need
to make changes to the way the packaging system interacts with files
will still need to understand the interfaces of the objects that the
FileManager uses.

The image used to be the place where we kept ahold of the download
cache.  I'd like to know more about how you expect to keep layouts
modular, and properly detected at runtime.

For now, that's also not this wad. Originally, I'd planned to use a file at a known location in the file directory to specify the layout. Then I realized there was no reason to do this. If you can provide an example of why/how we'd need this, I can be convinced to put it back in. I think one thing we've tried to do is not lay track until we need it. I don't believe we'll need further layout changes anytime soon, hence I've decided not to put in lots of code to handle a case I consider unnecessary.

The two scenarios this code supports right now is that either the source
and destination are both read/write, or both read only.  I'm asking if
it's reasonable to expect a deployment scenario where the source is
readonly but the destination is read/write.  (A read/write source with
readonly destination is a degenerate case of both sides being
read-only).  In the case where the source is read-only, you can't move
the file to the new layout, but you could copy it, but I don't see
support for this case.

No, I don't think that's a reasonable scenario at this time for the file manager to support. For now, I want to keep its job narrow and constrained, not make it a general repo seeding method, or anything else.

I didn't say now. I asked if this was a reasonable deployment scenario,
irrespective of any period in time.
Ok, then no, I don't think that's ever a reasonable deployment scenario.
file_layout/layout.py:
  - Why do you allow Original to name itself "s1" by defining SHA1_160,
    but Fan256Once allows the class that instantiates it to give its
    name.  Wouldn't it be better to encode the name as a class property
    in both cases instead?
Original can only be SHA_160 since that's the old format and we're not going back to change it. Fan256Once can use SHA_256, when we add it, as well, hence it's an argument to the constructor.

  That would make it a candidate for a class propery, at least in
this case.
Yes, but the other layouts aren't amenable to a class property, so why special case this one?

One possibility is to take the approach that Shawn used with the
Publisher objects.  In his code, the class property gets the default,
but the instance property overrides the class default if the caller
specifies a different set of arguments.  This allows you to provide a
standard case, even to classes that inherit from you, but still modify
the behavior when the object is instantiated.

Again, that seems overly complex for no reason for this situation. If you really care, fine, I'll change it, but I don't think it's the right thing to do here.
[snip]
since  it's entirely internal to the file manager. It's simply there
to allow  us to know what kind of hash algorithm was used so that if
we use hash  algo 1 (which produces N bits) and hash algo 2 (which
produces N bits)  at different times, we'll be able to avoid
collisions. If you'd like to  suggest a more meaningful name, by all
means do.

I'm actually confused what you're doing with this, partially due to the
absence of documentation in the code or elsewhere.  Are you using the
label to so that if you saved the name to a file, you could re-load the
appropriate algorithm later?   Or are you using the label to identify a
function in a table of algorithms?  Or something entirely different?

I'm using the label to prevent collisions between filenames created using different hash algorithms.
It looked to me like you were going to use this as a key for a
dictionary, where the value was a hash-function.  If you're doing that,
it makes far more sense to pass an object that implements a hash
function to the layout manager.  Using the label as a enum is a C-ish
construction that we should probably avoid here.
Nope, I have no intention of using it like that.
This object doesn't check that it can use the hash type, nor is it
apparent what this heiararchy does with the value.
It doesn't check the hash input, you're correct. This is not the "support multiple hashes in our actions" wad.

I'm not talking about Actions.  I want the interface for adding hash
functions to the layout manager to be modular.

Please define modular. Let me be clear, *I am not solving the multiple hash algorithm issue.* If it's confusing the issue, I'll happily rip all notion of hash algorithm out of the implementation if that will make the scope of this wad clearer.
I'd suggest that you use a more object-oriented approach and pass an
object that knows how to perform the hash function to each layout
manager.
I don't understand.

I mean that instead of passing value "s1" to the object, you pass it a
HashSha1160 object that implements a operations described in the
AbstractHashObject.  Just as an example, it could support hash(), which
generates the secure hash, and id() which returns a identifier that
uniquely describes this type of hash, and any other operations that you
think are necessary here.
I think I've explained why this is currently overkill, and I believe it will always be overkill b/c the file manager is not ever actually computing the hash.
Moreover, it seems like changing the layout function to work more
generically would be helpful.  Instead of having it be Fan256Once, how
about FanoutUniform() with the depth and the number of directories given
as arguments to the constructor.
Because I want to keep this as small and simple as possible. If you'd like a more generic layout class, you're free to write a subclass.

I think you've missed my point.  If you write the fanout logic to be
sufficiently generic, then implementations can specify the
characteristics they desire.  The code will be resuable, and then nobody
will have to actually write the subclass you're suggesting that I write.

See previous comment about unused track. It will make the code more complex for a situation I view as unlikely to happen.
General comments:

  - You've created the file layout and file manager as part of an
    extensible architecture.  Would you provide more information about
    your design and architecture?  As an example, how would I go about
    adding a new layout to this design?
You'd add a new class to the layout file, and add it an instance of it as the first item in self.layouts in FileManager.
This isn't exactly what I'm after.  I'm more curious about your
architecture.  How are you envisioning that this will be extended?  Can
you document your design goals?  Could someone who looks at this
document figure out how to enhance the archtiecture later?  You're
checking this in as a new subsystem, but it's currently handling a
narrow set of tasks.
Yes, it's handling a narrow set of tasks, why is that an issue?

It's obvious that we'll need to add more functionality to this code
later on.  It sits in a location where many different components need to
interact with it.  The project is at the point where we need to get
things right.  The fact that you won't give me design or architecture
information, and refuse to discuss possible enhancements as, "not this
bug," makes me concerned that this hasn't been thought through
completely.
I think I've articulated my thoughts on this as much as I can.
I would assume that anyone who wanted to enhance it later would have as much clue as I do how to go about it. I'm not sure what you're after. This isn't the be-all end-all to manage the cache. At best, it's laying the ground work to allow future work to do cache management. What that'll look like? I don't know. I have real questions whether any sort of time based cache makes sense.

I'm asking that even if the code isn't complete, the design is.  That
way, when other people come along to work on this code we have artifacts
that they can use to make sure the code is consistent with our overall
design.
See above. I'm not going to design lots of hypothetical uses when all I'm providing is a VERY basic framework that centralizes access. I can't see how any part of this would be restrictive to whatever the next person wants to do. Again, if you have an example of something I'm missing, by all means present it.
You might not be implementing support for it in this bug, but if you're
designing a subsystem to replace the existing, admittedly naive,
download cache, you should think about how you plan to address these
issues.
That's not what I've signed on for. If that's necessary, I'll vastly reduce my change set and simply rewhack the bits in misc to try and be slightly better, or hand it off to someone who wants to attack the larger issue of cache management.

Don't let the idea of a design scare you.  I think that if you spent an
afternoon thinking about these issues, you'd probably have enough to
satisfy most people.  There certainly isn't any need to give up your
code and go home, especially since you have something that works.

Ok, let me try to explain my position again. I've provided some basic tools for people to use. I don't claim these tools are complete for any and all follow on work. The claim I make is that it's an improvement on what we have now.
If you haven't thought about how you expect someone to accomplish these
tasks, you should make sure that the current design doesn't preclude
intelligent cache management, because that's the very next thing that
we'll have to do with this code.
In short, I believe the right place to do this is actually in the same place we do indexing. By default, what it should do is remove the files from the download area which have been removed or changed in the new packages. Only the imageplan has this knowledge, and I don't believe there's a sane way to put that approach into the file_manager b/c it doesn't have access to the plan.

Ah, this is the kind of topic I was trying to get you to discuss before. In this example, you don't think that the imageplan would want to pass
this information to the file_manager as it completes an operation?  In
essence, giving it a list of candidates to remove if the cache is full,
or perhaps, a pre-generated list of files when a pkg houskeeping command
is run from cron?
No, I'd think like all operations, we'd want to wait to make sure we were in a known good state before doing something destructive (removing files from the cache). I think a cron approach is wrong, and represents other deficiencies in our code. I hope, if this wad ever goes back, to write a pkggc or pkg gc command which will do a cleanup of the directory. I haven't totally decided what characteristics that will have, but the current code (with the possible exception of a modification to walk I haven't decided about yet) satisfies my needs there.
In your model, the imageplan has to do something with this information,
but it probably shouldn't be managing the cache directly.  You're
suggesting that something other than the FileManager handles this?  What
problems do you see with adding this functionality to the FileManager,
perhaps in some kind of FileManager.cleanup() operation?

That's correct, I'm stating that the FileManager object is not a cache manager.
In fact, I see this work making those steps easier b/c now  there's a
central location through which well behaving code will add and  remove
files, making it easier for there to be something that has  knowledge
of the state of the layout as a whole.

Yes, I agree, and this is why I'm trying to ask more detailed questions
about the design.  Other components in the system are going to use this
code to access files, and having a well abstracted and modular interface
is important.
Yes. they're going to access the files, meaning they're going to add them, remove them, and look them up. That's it. As a bonus, you can also get a list of all the files. If you have specific components you'd like a design for, please name them.

You mentioned multiple hashes before. Briefly, this is how I'd imagine that working. The hash is self identifying. Let's say that we add SHA-256. The file_manager would be changed so that its layouts now has 3 items, starting with an instance of Fan256Once created with SHA_256 as an argument, where SHA_256 merely maps to a string like s256 (or whatever name you'd like). Woot, we're done. I think everything would then work correctly. Do I know that it would? No, because I can't make that claim without doing the work to fully design and build multiple hashes.

Brock


-j

_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to