On Sat, 2 Jul 2011, Chris Travers wrote:

> On Sat, Jul 2, 2011 at 1:16 PM, Luke <[email protected]> wrote:
>
>> Probably though, as I think about it, this would require globally unique
>> filenames, and a name comparison with new uploads, possibly followed by a
>> content comparison if names match.
>> I'm not sure globally unique filenames are such a bad idea anyway.
>
> There's a fairly nasty case here that you can run into.  If globally
> unique file names are required, then how do you know in advance what
> sort of names are used?  Do we want to expect the users of the system
> to all come up with naming conventions that avoid collisions?

I was expecting that, yes.  However, I shouldn't.  My recent experience is 
with reasonably disciplined corporate users, who either get files from 
sources with likely to be unique names (some form of the vendor name and 
vendor's ID), create files for customers/vendors with names of the same 
type, or are good at storing files with rather long, descriptive, and 
accidentally unique names.

However, if we combine our two ways of looking at this, I think we have 
the solution.

If you store files by ID, and internally reference them by ID at all 
times, they can all be called "foobar.pdf" and it doesn't matter.

When a new file is uploaded, compare its CRC/checksum to the index of 
stored files.  If there's no match, it's a new file, gets a new ID, and we 
save it. If the checksum matches, compare the contents to the N files with 
the match.  If one of them matches, create a link to the existing copy, by 
inserting a referential entry in whatever table is tracking where files 
are attached. If none of them matches, it's a new file.

I'm pushing this, because I think it's more extendable, and it also leads 
directly to what Erik wanted.

If you divorce the storage of files, and the way they are tracked, from 
the documents to which they are attached, you get a true virtual 
filesystem.
Any document can point to any file(s), and any file can be pointed to by 
one/some/no documents.

Associations can be re-mapped after file storage (this assumes a file 
management UI at some point), which is necessary for Erik's suggestion.

Luke

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Ledger-smb-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ledger-smb-devel

Reply via email to