Re: [fossil-users] some questions about fossil-as-document-repo
On Fri, Jan 18, 2013 at 09:33:05AM +0800, David Baxendale (GMail - Singapore) wrote: I don't think Fossil is the right tool for this, take a look at Calibre (http://calibre-ebook.com/) as an Open Source document management system, not just an e-book reader. Calibre can't handle several 100 k documents, and it can't do full text search on the document body -- at least the versions I used couldn't. Calibre manages your e-book/book/PDF collection and can sort the books in your library by: Title, Author, Date added, Date published, Size, Rating, Series, etc. In addition, it supports extra searchable metadata: * Tags: A flexible system for categorizing your collection however you like * Comments: A long form entry that you can use for book description, notes, reviews, etc * User fields, so you can have a revision code, or you could include the revision code in the title (probably better), for example Only an option for small, hand-curated document stores. Imagine having to deal with 100s of millions or billions of documents. You can only process such volumes automatically. You can easily search your collection for a particular book. Calibre supports searching any and all of the fields mentioned above. You can construct advanced search queries by clicking the helpful Advanced search button to the left of the search bar. You can export arbitrary subsets of your collection to your hard disk arranged in a fully customizable folder structure. For group access Calibre has a built-in web server that allows you to access your collection using a simple browser from any computer anywhere in the world. It can also email your books and downloaded news to you automatically. It has support for mobile devices, so you can browse your collection and download books from your smartphone, Kindle, etc. One point to note is that systems files the documents by Author/Title on the hard disk, this is fixed and you cannot change this. However, this is not as inflexible as it sounds, because the Author could be a Client, Journal, or whatever you wish. A good way to organize documents save of using a real database is to name them by cryptographic content of their hash, and to store them into directories named by the first octet (subdirectories by the second octet, more for extremely large assemblies). You would still use a real database to find the documents. I use Calibre for my technical library with over 8000 technical papers Library Genesis (both content and source code freely available) currently has 0.85+ Mvolumes, and will be probably at several Mvolumes before very long. It would be a good idea if somebody would extend the libgen codebase to full text index search of the document body. and have found it an indispensable tool for managing and finding information. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
Tomek Kott tkott.li...@outlook.com writes: Might I suggest the following two tools as better suited for this sort of endeavor? 1) Zotero - http://www.zotero.org/ This looks very interesting, and I can see where I might find a use for it myself in my personal life. Unfortunately, I don't think it will be a solution to my original problem since, as I mentioned, the documents I'm dealing with are being retained for legal reasons--which would be problematic for a service using a third-party server. In addition, whatever repository ends up being in place, more than a dozen people will need (read-only) access to it, and installing Zotero on everybody's PC is just one more thing for an already-stressed IT staff to keep up with (as opposed to one fossil binary on one server). 2) PDF XChange for free OCR - Fortunately, OCR is not an issue for us: the copiers/scanners we already have on contract have a fairly good OCR function built in. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
C. Thomas Stover c...@thomasstover.com writes: Well if hardcopy means scanned paper (no ocr) then it sounds like a very large binary file set. I'm showing my ignorance, but does OCR matter in this case? We already have OCR capabilities, and I had intended to scan in the documents using it--because, why not, if you can? I didn't think to mention it in my original post to the list because I didn't think it would change the average file size significantly. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
On Thu, 17 Jan 2013 07:55:09 -0600 Carson Chittom car...@wistly.net wrote: C. Thomas Stover c...@thomasstover.com writes: Well if hardcopy means scanned paper (no ocr) then it sounds like a very large binary file set. I'm showing my ignorance, but does OCR matter in this case? We already have OCR capabilities, and I had intended to scan in the documents using it--because, why not, if you can? I didn't think to mention it in my original post to the list because I didn't think it would change the average file size significantly. Well think about like this. In order to get a good enough detail for most purposes, these document scanners have somewhere around 600x600dpi resolution. At first you might think monochrome would work great (and it is still used sometimes with very high res modes), but in practice gray scale (or color) is really needed for handwriting, old paper, charts, and all sorts of applications. So the uncompressed bitmap for a single page can be quite big. So what about image/raster data compression? Well you either have loss-less (PNG) which works great for rendered vector graphics (diagrams, screen shots, etc), or loss-y (JPEG) which uses the characteristics of they way human vision processes colors to really work great for photographs. Neither one of these work that good for generic pieces of paper. What ends up happening is people just do an image resize to a smaller resolution, which (especially for handwriting) can be self defeating. On the other hand think how much space it takes for a page of UTF-8 text. Not much. So perfect OCR (which is a virtual impossibility) would take a 10+mb bitmap and convert it into a 2k text file. The solution today's technology uses is by using a container format like PDF where both images and text can be stored, the scanner software/firmware will OCR what it can and then mix that with little cropped images. This of course leads to the your mileage may very file sizes. -- C. Thomas Stover www.thomasstover.com ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
On Thu, 17 Jan 2013 18:53:43 +0100, C. Thomas Stover c...@thomasstover.com wrote: On Thu, 17 Jan 2013 07:55:09 -0600 Carson Chittom car...@wistly.net wrote: C. Thomas Stover c...@thomasstover.com writes: Well if hardcopy means scanned paper (no ocr) then it sounds like a very large binary file set. I'm showing my ignorance, but does OCR matter in this case? We already have OCR capabilities, and I had intended to scan in the documents using it--because, why not, if you can? I didn't think to mention it in my original post to the list because I didn't think it would change the average file size significantly. Well think about like this. In order to get a good enough detail for most purposes, these document scanners have somewhere around 600x600dpi resolution. At first you might think monochrome would work great (and it is still used sometimes with very high res modes), but in practice gray scale (or color) is really needed for handwriting, old paper, charts, and all sorts of applications. So the uncompressed bitmap for a single page can be quite big. So what about image/raster data compression? Well you either have loss-less (PNG) which works great for rendered vector graphics (diagrams, screen shots, etc), or loss-y (JPEG) which uses the characteristics of they way human vision processes colors to really work great for photographs. Neither one of these work that good for generic pieces of paper. What ends up happening is people just do an image resize to a smaller resolution, which (especially for handwriting) can be self defeating. On the other hand think how much space it takes for a page of UTF-8 text. Not much. So perfect OCR (which is a virtual impossibility) would take a 10+mb bitmap and convert it into a 2k text file. The solution today's technology uses is by using a container format like PDF where both images and text can be stored, the scanner software/firmware will OCR what it can and then mix that with little cropped images. This of course leads to the your mileage may very file sizes. just my 2c: there's also djvu http://djvu.org/ which provides astonishingly good compression for scanned documents, separation of layers, OCR etc. and there are converters from pdf to djvu around. otherwise I don't think that a SCM is really the suitable tool for your intended purpose (which I perceive as maintaining/backing up a list of versioned binary files): all SCMs that I know are not really good at handling big binary data sets (and delta-compression sure will not work that great...). so the repo will get real big in no time (and, for a DVCS, be copied to each and every user's account). but all the things the SCM offers (diffing, branching, merging) will _not_ work with binary data in a sensible way (I believe) and this also seems not to be what you need anyway. so the question is: why put it under revision control at all? the meta-information provided by the checkin-messages in the timeline alone would not be a sufficient reason in my view. I could imagine that a very basic solution (use the file system and maintain a logfile (or a (sqlite?) database (or fossil repo) of the metadata (file xyz.version_123 has this or that content and can be found here: and, possibly, as already suggested, an OCR dump) is more sensible. as said: just my 2c. -- Using Opera's revolutionary email client: http://www.opera.com/mail/ ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
On Thu, 17 Jan 2013 19:48:20 +0100 Stephan Beal sgb...@googlemail.com wrote: FWIW: if the documents are having to be archived for legal reasons then the OCR versions are essentially only useful for convenience in searching, and not for legal purposes. that's good information to know On Thu, 17 Jan 2013 19:51:58 +0100 j. van den hoff veedeeh...@googlemail.com wrote: just my 2c: there's also djvu http://djvu.org/ which provides astonishingly good compression for scanned documents, separation of layers, OCR etc. always good to find new things otherwise I don't think that a SCM is really the suitable tool for your intended purpose (which I perceive as maintaining/backing up a list of versioned binary files): all SCMs that I know are not really good at handling big binary data sets (and delta-compression sure will not work that great...). so the repo will get real big in no time yep. I've tried this a number of ways with photos, and it just didn't work out. Although I have stored large number of mostly text-based pdf's in scm before for lack of better tool, and it wasn't the end of the world. Someday someone will create a tool to fill in the gap. Sort of a DVCS style metadata logging and control facility to a rsync style technology. Kind of like some of the interpretations of distributed file system back in the plan 9 lineage of thought for instance. C. Thomas Stover www.thomasstover.com ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
I don't think Fossil is the right tool for this, take a look at Calibre (http://calibre-ebook.com/) as an Open Source document management system, not just an e-book reader. Calibre manages your e-book/book/PDF collection and can sort the books in your library by: Title, Author, Date added, Date published, Size, Rating, Series, etc. In addition, it supports extra searchable metadata: * Tags: A flexible system for categorizing your collection however you like * Comments: A long form entry that you can use for book description, notes, reviews, etc * User fields, so you can have a revision code, or you could include the revision code in the title (probably better), for example You can easily search your collection for a particular book. Calibre supports searching any and all of the fields mentioned above. You can construct advanced search queries by clicking the helpful Advanced search button to the left of the search bar. You can export arbitrary subsets of your collection to your hard disk arranged in a fully customizable folder structure. For group access Calibre has a built-in web server that allows you to access your collection using a simple browser from any computer anywhere in the world. It can also email your books and downloaded news to you automatically. It has support for mobile devices, so you can browse your collection and download books from your smartphone, Kindle, etc. One point to note is that systems files the documents by Author/Title on the hard disk, this is fixed and you cannot change this. However, this is not as inflexible as it sounds, because the Author could be a Client, Journal, or whatever you wish. I use Calibre for my technical library with over 8000 technical papers and have found it an indispensable tool for managing and finding information. -- Regards, David Baxendale Message: 5 Date: Wed, 16 Jan 2013 19:31:59 -0500 From: Tomek Kotttkott.li...@outlook.com To: Fossil SCM user's discussionfossil-users@lists.fossil-scm.org Subject: Re: [fossil-users] some questions about fossil-as-document-repo Message-ID:col002-w50fb60fc6525b15acfa82af3...@phx.gbl Content-Type: text/plain; charset=iso-8859-1 Might I suggest the following two tools as better suited for this sort of endeavor? 1) Zotero -http://www.zotero.org/ 2) PDF XChange for free OCR -http://www.tracker-software.com/product/pdf-xchange-viewer The first is a good pdf sorter that can work in stand alone mode. You can also tag things with metadata / tags / years etc. The second is a free PDF reader that I use instead of Adobe, and recently it was updated with free OCR. In my use the OCR has actually been very good. It can place the text of the PDF behind the image, so you can select the text while viewing the original scanned copy. I do this for bills and such at home. I personally don't see fossil as the right tool for a document repo. Tomek Date: Wed, 16 Jan 2013 16:33:09 -0600 From:c...@thomasstover.com To:fossil-users@lists.fossil-scm.org Subject: Re: [fossil-users] some questions about fossil-as-document-repo On Wed, 16 Jan 2013 16:11:49 -0600 Carson Chittomcar...@wistly.net wrote: Yes, basically, it's the probably should save for later need--mostly for legal reasons. Currently all this is in hardcopy, as I mentioned, the volume of which has reached such a level as to be simply impenetrable; part of the reason for putting them as images into a repository is simply to organize them. Well if hardcopy means scanned paper (no ocr) then it sounds like a very large binary file set. That sort of thing quickly gets up larger than most photo collections. The logic of the concept is sound. Report back on how it goes in practice. -- C. Thomas Stover www.thomasstover.com ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
On 01/17/2013 02:29 PM, Carson Chittom wrote: But these are not legal documents in the sense I think you mean--contracts, etc. Our lawyer keeps those. Our use case is more of a question of one of our staff being able to find something that documents that previously we did x in case y, so if we get case z we should also do x if y = z, Then OCR is what you want and any OCR typos can be caught by the reader. Don't store the images. Then the diffs and compressions *will* work in the SCM. On 01/17/2013 07:33 PM, David Baxendale (GMail - Singapore) wrote: Calibre manages your e-book/book/PDF collection and can sort the books in your library by: Title, Author, Date added, Date published, Size, Rating, Series, etc. In addition, it supports extra searchable metadata: * Tags: A flexible system for categorizing your collection however you like * Comments: A long form entry that you can use for book description, notes, reviews, etc * User fields, so you can have a revision code, or you could include the revision code in the title (probably better), for example Calibre does sound good. I'm going to look into it for managing datasheets used in electronics designs. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
C. Thomas Stover c...@thomasstover.com writes: On Tue, 15 Jan 2013 16:37:46 -0600 Carson Chittom car...@wistly.net wrote: While I realize that this is a somewhat different emphasis than fossil's usual orientation, I have suggested to my work superiors that fossil may be usable for us as a document repository, given its (lack of) cost and its additional features we could leverage. Almost exclusively, the documents we have are binary files, primarily PDFs, as we are largely scanning in paper documents. This hits close to home. My 2ยข (or your choice of 3.5% tip) is start with the question Why are you doing this?. Specifically, if you are trying to have a dumping ground for random pdf's in your group that are valued around probably should save this for later, then this approach (up to a point) is probably feasible. Yes, basically, it's the probably should save for later need--mostly for legal reasons. Currently all this is in hardcopy, as I mentioned, the volume of which has reached such a level as to be simply impenetrable; part of the reason for putting them as images into a repository is simply to organize them. On the other hand, if you are trying to do actual collaborative work on documents, then it is absolutely critical that in addition to a SCM system (fossil would be great), that you move to a text/source based document generation technology. That is generally a much harder pill to swallow for most non-developer users expecting wysiwyg editing with magic sauce. Regardless of what efforts are expended otherwise, the result will always be failure. This I have learned is just a reality of the universe in which we live. Unfortunately, and much to my dislike, what collaborative work we do will probably end up being done with Microsoft Word's Track Changes feature. As bad as that is, it's still better than sending what are effectively different documents back and forth, and keeping track manually of which is the latest version. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
On Wed, 16 Jan 2013 16:11:49 -0600 Carson Chittom car...@wistly.net wrote: Yes, basically, it's the probably should save for later need--mostly for legal reasons. Currently all this is in hardcopy, as I mentioned, the volume of which has reached such a level as to be simply impenetrable; part of the reason for putting them as images into a repository is simply to organize them. Well if hardcopy means scanned paper (no ocr) then it sounds like a very large binary file set. That sort of thing quickly gets up larger than most photo collections. The logic of the concept is sound. Report back on how it goes in practice. -- C. Thomas Stover www.thomasstover.com ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
Might I suggest the following two tools as better suited for this sort of endeavor? 1) Zotero - http://www.zotero.org/ 2) PDF XChange for free OCR - http://www.tracker-software.com/product/pdf-xchange-viewer The first is a good pdf sorter that can work in stand alone mode. You can also tag things with metadata / tags / years etc. The second is a free PDF reader that I use instead of Adobe, and recently it was updated with free OCR. In my use the OCR has actually been very good. It can place the text of the PDF behind the image, so you can select the text while viewing the original scanned copy. I do this for bills and such at home. I personally don't see fossil as the right tool for a document repo. Tomek Date: Wed, 16 Jan 2013 16:33:09 -0600 From: c...@thomasstover.com To: fossil-users@lists.fossil-scm.org Subject: Re: [fossil-users] some questions about fossil-as-document-repo On Wed, 16 Jan 2013 16:11:49 -0600 Carson Chittom car...@wistly.net wrote: Yes, basically, it's the probably should save for later need--mostly for legal reasons. Currently all this is in hardcopy, as I mentioned, the volume of which has reached such a level as to be simply impenetrable; part of the reason for putting them as images into a repository is simply to organize them. Well if hardcopy means scanned paper (no ocr) then it sounds like a very large binary file set. That sort of thing quickly gets up larger than most photo collections. The logic of the concept is sound. Report back on how it goes in practice. -- C. Thomas Stover www.thomasstover.com ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] some questions about fossil-as-document-repo
C. Thomas Stover wrote: On the other hand, if you are trying to do actual collaborative work on documents, then it is absolutely critical that in addition to a SCM system (fossil would be great), that you move to a text/source based document generation technology. The real problem is a lack of diff merge tools for the particular document format. Given those, any file format can be used well in a SCM. [The lack of such tools for formats such as MS Word and PDF simply boggles the mind. You would think that such formats were never used for serious documentation...] Graeme Gill. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users