Greg seems to catch the main points made so far (about cvs and binary files and scale) and ask some questions, so i'm choosing to continue from Greg's message:
Quoth [EMAIL PROTECTED] (Greg A. Woods): > How important is tracking the actual changes made to the TIFF's or the > auxilliary images? Can you get by with simply replacing them (perhaps > with a remark made about this replacement in the metadata file(s))? an interesting question. in my tenure with this project, we've only ever been interested in "the state of this TIFF or .txt on DATE" a handful of times, and it was usually to check to see what was visible at the time. i thinking that having all the historical versions of the TIFF files may not be as important as having all the historical versions of the metadata binding all this together. > Is the OCR'ed text and metadata kept in ASCII (or other diff-able text) > form? yes, the ocr'd text is plain old text, in theory containing characters as ambitious as iso-8859-1. the metadata is also text, and can contain utf8. each "page" of the ocr'd text is a separate file at this time. > Are you able to deal with making changes only to individual top-level > chunks at any one time? not quite sure i understand this question, but perhaps if i explain: we can make a change to any part of the corpus at any time: it might be a single text file, it could be a single TIFF file (and possibly a new version of text to go with a vastly improved TIFF), the binding metadata. > How important is it to allow concurrent editing of the text/metadata? not very: there are a small number of people who operate on this data, and they are assigned pieces and parcels to work on exclusively until done. we don't now have a system-based locking mechanism other than how the assignments are made (a social process amongst staff). > CVS is clearly not suitable for tracking changes to binary data, > especially not in the scale of your corpus. However the other parts may > be maintined with CVS, depending on how well you can break the entire > corpus into manageable chunks, and perhaps depending on how much you can > afford to manipulate several copies of all these files. can folks speculate on what makes for the largest manageable chunk for cvs? if i make each of my 3,000 top-level chunks into a cvs module, what senses might we call that manageable or unmanageable? is 10,000 objects in a cvs module too much, just fine? ... Donald Sharp suggests to keep looking, any ideas about where/who are other places to look? known individuals or outfits that might have expertise here? i do really appreciate the comments made, and i get the sense that i'm not crazy in thinking that this is a large and not-simple problem. thanks much! cheers, nigel > (sounds a lot like what the guys at catalogues.google.com are doing, > though with more exacting detail than would be necessary for searching > scanned catalogue pages) tangentially, i've been asked if we can have a "highlight the term on the page image feature like the google catalog service" recently as well. the bar never stays in one place, and it never moves lower... _______________________________________________ Info-cvs mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/info-cvs
