After seeing the release of codeq, it got me thinking about making my own database app for analyzing other types of text files, images, pdfs etc.
There are a few issues though with regards to dealing with the files. Currently, I think its best to just store a reference to the file in the datomic db, and have the actual files on some computer. The problem is files have a naming problem. Often files will have the same name even though the content is different. Also its possible that the file already exists on the filesystem and we want to not have duplicates. Also we would like to retain usage of standard file management tools like scp, sftp, nautilus, finder... etc. What are some good solutions to dealing with these issues? Currently, these are my thoughts: Solution 1: Give each file an incrementing id per file and use it as the filename. In the database store the sha and filename. Problems: Listing the directory can be very large since all the files are in one directory. Adding a file will have to perform a lookup to figure out what is a free id, although in practice this probably isn't an issue since i/o will probably be the main bottleneck. Solution 2: Compute the SHA of the file. For the first 4 digits, create directories, then create directories of the remaining digits, then store the file as is with the original filename in the directory. Problems: More complex. The SHA must be computed. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to [email protected] Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/clojure?hl=en
