On Mon, 4 Jul 2011, Chris Travers wrote:
On Mon, Jul 4, 2011 at 12:33 PM, Luke <[email protected]> wrote:However, if we combine our two ways of looking at this, I think we have the solution. If you store files by ID, and internally reference them by ID at all times, they can all be called "foobar.pdf" and it doesn't matter.Ok, so which foobar.pdf attached to a sales order is which one? i'd rather have some sort of unique constraint on the file name even if it isn't global. My current thinking is a uniqueness between the primary attachment point and the file name. So you can only have one "contract.doc" attached to a given sales order. However, if you haveExactly. Enforce name uniqueness at the document level. "there is already a file of that name attached to this document."Ok, so here's the problem as I see it. We can't enforce this in Data Definition Language (DDL) without the link being part of the file table. Consequently we have to check it procedurally. I would prefer to put whatever logic is possible to put in DDL in DDL because this provides more pervasive enforcement. There's another side here too: Suppose we have three orders each with a file attached to them called "Contract.doc." We consolidate them.
Hadn't thought of that.
How do we want to handle the file attachments? If it were up to me, I'd be displaying something like: Contract.doc Attached from order 42 Contract.doc Attached from order 43 Contract.doc Attached from order 44
Seems reasonable enough. So instead of being unique to the filename, you'd be unique to the filename and the source attachment point?
The problem I see there, is what happens if order 42 is deleted after the consolidation? To prevent that, you probably have to snapshot the data, which is not elegant at all..
That way all three can co-exist on an order without folks being confused as to what they are or which is which. It seems to me that to support order consolidation we'd still have to do the same thing (an order-attachment to order) table or else run into the conflict of names constraint, right?
It is seeming like that.Either that, or during a consolidation or addition operation, you do automatic pre-extension numbering.
I.E. The file from order 42 is contract1.doc, the one from order 43 becomes contract2.doc, and the one from order 44 becomes contract3.doc.
Or OrdNumber-contract.doc, so: 42-contract.doc, 43-contract.doc, 44-contract.doc, which is probably better.
The user can possibly rename them before saving?
You're talking about doing it in the database, which takes us back to some structure involving a table for each document class. The way I was thinking about (assuming the routine logic would take care of managing it) is probably too similar to how 1.2 would have done it, but your suggestion below *seems* at first look to get the job done.I think we are looking at this from two different angles. I think you are asking how to maximally decompose the data for greatest flexibility. That's not a bad question but I am asking how to enforce constraints using the data definition language so that there is less of a question of enforcement or confusion later. If I understand the discussion correctly it's an argument between normalization (in higher normal forms) and constraint dependencies.
Yes.
To be clear the tradeoff between the two approaches is that my approach is narrower in the sorts of data it accepts up front, and the DDL is quite a bit more complex, and this is both a good and a bad thing.
I have no objection to the DDL complexity, if the parts I want flexible can be made so.:) I still envision most of this being black-boxed into API calls or database routines, from the third party access prospective.
A simpler DDL (in the sense of not partitioning the file table as per your proposal) provides greater flexibility. It also reduces the representation of the data to the bare minimum therefore allowing more things to be done with the data on an application level without changing the schema. The tradeoff is that we cannot define semantic constraints on the data through DDL, meaning that the tables cannot be considered to be a safe API in the future. If we go with a maximum decomposition model like yours, I would prefer to change things as follows: CREATE TABLE file_storage ( content bytea, mime_type int, id serial not null unique, PRIMARY KEY (content, mime_type),
Really? Primary key on a blob? Didn't even know you could do that.However, why? Why do it on content, instead of generating a checksom or similar up front, and keying on that?
You dropped that from my version, but didn't say why. Have I missed something there?
); CREATE TABLE file_attach_base ( file_name text not null, file_class int references file_class(id) ref_key int, file_id int not null references file_storage(id), attach_id serial not null unique, primary key(file_name, ref_key) ); And then inheriting out the file_attach_base table, and creating secondary linking tables as I previously suggested. This would allow breaking out:the content into a separate file, but it would probably make inserts pretty expensive because the unique index of the content would have to be updated on each insert. I doubt it would make it more costly than the method you suggest, and the positive side would be that hard links would be supported out of the box because file names would be separate from contents.
Yes, I like that, and I like the design so far I think, except for that contents primary key.
Please clarify why we have to do that.Having a checksum run on uploades, and then indexing that, is going to be slower you think?
Using the cksum utility on a 100 MB file of random data (dd'ed from urandom), on a 1.6GHZ laptop:
real 5.62 user 0.47 sys 1.95Given post limits and such, most uploads are going to be 1-2% of that size.
Is it just that indexing the content, absolutely insures no duplicates, whereas checksumming it only probably does?
It would still require a table inherited from file_attach_base for
handling unattached invoices but that isn't hard to do. Something
like:
CREATE TABLE file_import_source (
id serial not null unique,
label text primary key
);
insert into table file_import_source(label) values ('email');
insert into table file_import_source(label) values ('ledgersmb:
unattached file addon');
CREATE TABLE file_unattached (
FOREIGN KEY ref_key REFERENCES file_import_source,
CHECK file_class = 4 -- or whatever
);
What would happen if a file becomes unattached by virtue of document deletion, or intentional dis-association?
Triggers?
If we go this route, the tradeoff is flexibility vs performance. To insert a file we have to check against the content and MIME type of all existing files. With a large number of files, I think that index would get very, very large.
Agreed, which is why I suggest checksums.
a secondary link there (to be implemented in the future) to the sales invoice, the user could see "contract.doc" linked from "sales order 12345." There could only be one "contract.doc" linked from that sales order to that invoice though there could be additional contract.doc's from other sales orders onto the same invoice.When a new file is uploaded, compare its CRC/checksum to the index of stored files. If there's no match, it's a new file, gets a new ID, and we save it. If the checksum matches, compare the contents to the N files with the match. If one of them matches, create a link to the existing copy, by inserting a referential entry in whatever table is tracking where files are attached. If none of them matches, it's a new file.I am still worried about handling naming collisions in a way that is user-transparent.Here's what I'm envisioning... There are two public facing database routines which insert files, and one which associates them with documents. The first, inserts a file, and associates it with a document: * If the name of the uploaded file is the same as one associated with this document, throw an error. * It does the checksumish comparison that I spoke of above, without reference to name. If there's a match, create a link instead of storing; if not, store the file.If we go this route and enforce everything in the procedures, we'd have to say no third party access to the underlying tables, right?
Yes. I was assuming that all file access would be handled through the procedures, or by views.
So it seems like we have three options here: 1) Semantically incomplete DDL constraints, no third party access to the underlying tables 2) Semantically complete DDL constraints, maximally decomposed tables, performance hit, but third party access to underlying tables
Isn't your arrangement above, and in the previous message (secondary linkages), leading us toward option 2, but with option 1's private table access?
3) Semantically complete DDL constraints, no enforced uniqueness on file data, third party access to underlying tables
#2 seems best to me, if it can be done. I would like to preserve uniqueness of files as much as possible for several reasons which I have discussed at length. That means that files should be able to have multiple names, and should be linkable to multiple documents of various types. If we can do all that in the database without relying on procedures to pull it all together, then splendid.
> Long run I am not really sure what the operational
tradeoffs of such decomposition are, regarding remapping on the fly and enforcing uniqueness of data. My personal preference is to give access to tables where possible.
I agree with that, but if we have to... Do we have other tables in 1.3 where that's the case?
Luke
------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________ Ledger-smb-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ledger-smb-devel
