After seeing the release of codeq, it got me thinking about making my own 
database app for analyzing other types of text files, images, pdfs etc.

There are a few issues though with regards to dealing with the files. 
Currently, I think its best to just store a reference to the file in the 
datomic db, and have the actual files on some computer. The problem is 
files have a naming problem. Often files will have the same name even 
though the content is different. Also its possible that the file already 
exists on the filesystem and we want to not have duplicates. Also we would 
like to retain usage of standard file management tools like scp, sftp, 
nautilus, finder... etc.

What are some good solutions to dealing with these issues?

Currently, these are my thoughts:

Solution 1: Give each file an incrementing id per file and use it as the 
filename. In the database store the sha and filename.

Problems: Listing the directory can be very large since all the files are 
in one directory. Adding a file will have to perform a lookup to figure out 
what is a free id, although in practice this probably isn't an issue since 
i/o will probably be the main bottleneck.

Solution 2: Compute the SHA of the file. For the first 4 digits, create 
directories, then create directories of the remaining digits, then store 
the file as is with the original filename in the directory.

Problems: More complex. The SHA must be computed.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to