Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-18 Thread Eugen Leitl
On Fri, Jan 18, 2013 at 09:33:05AM +0800, David Baxendale (GMail - Singapore) 
wrote:
 I don't think Fossil is the right tool for this, take a look at Calibre   
 (http://calibre-ebook.com/)  as an Open Source document management  
 system, not just an e-book reader.

Calibre can't handle several 100 k documents, and it can't
do full text search on the document body -- at least the
versions I used couldn't.

 Calibre manages your e-book/book/PDF collection and can sort the books  
 in your library by: Title, Author, Date added, Date published, Size,  
 Rating, Series, etc. In addition, it supports extra searchable metadata:

  * Tags: A flexible system for categorizing your collection however you
like
  * Comments: A long form entry that you can use for book description,
notes, reviews, etc
  * User fields, so you can have a revision code, or you could include
the revision code in the title (probably better), for example

Only an option for small, hand-curated document stores.
Imagine having to deal with 100s of millions or billions
of documents. You can only process such volumes automatically.

 You can easily search your collection for a particular book. Calibre  
 supports searching any and all of the fields mentioned above. You can  
 construct advanced search queries by clicking the helpful Advanced  
 search button to the left of the search bar.

 You can export arbitrary subsets of your collection to your hard disk  
 arranged in a fully customizable folder structure.

 For group access Calibre has a built-in web server that allows you to  
 access your collection using a simple browser from any computer anywhere  
 in the world. It can also email your books and downloaded news to you  
 automatically. It has support for mobile devices, so you can browse your  
 collection and download books from your smartphone, Kindle, etc.

 One point to note is that systems files the documents by Author/Title on  
 the hard disk, this is fixed and you cannot change this. However, this  
 is not as inflexible as it sounds, because the Author could be a Client,  
 Journal, or whatever you wish.

A good way to organize documents save of using a real database 
is to name them by cryptographic content of their hash, and
to store them into directories named by the first octet (subdirectories
by the second octet, more for extremely large assemblies).

You would still use a real database to find the documents.

 I use Calibre for my technical library with over 8000 technical papers  

Library Genesis (both content and source code freely available) 
currently has 0.85+ Mvolumes, and will be probably at several Mvolumes
before very long.

It would be a good idea if somebody would extend the libgen codebase
to full text index search of the document body.

 and have found it an indispensable tool for managing and finding  
 information.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-17 Thread Carson Chittom
Tomek Kott tkott.li...@outlook.com writes:

 Might I suggest the following two tools as better suited for this sort
 of endeavor?

 1) Zotero - http://www.zotero.org/ 

This looks very interesting, and I can see where I might find a use for
it myself in my personal life.  Unfortunately, I don't think it will be
a solution to my original problem since, as I mentioned, the documents
I'm dealing with are being retained for legal reasons--which would be
problematic for a service using a third-party server.  In addition,
whatever repository ends up being in place, more than a dozen people
will need (read-only) access to it, and installing Zotero on everybody's
PC is just one more thing for an already-stressed IT staff to keep up
with (as opposed to one fossil binary on one server).

 2) PDF XChange for free OCR -

Fortunately, OCR is not an issue for us: the copiers/scanners we already
have on contract have a fairly good OCR function built in.

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-17 Thread Carson Chittom
C. Thomas Stover c...@thomasstover.com writes:

 Well if hardcopy means scanned paper (no ocr) then it sounds like a
 very large binary file set. 

I'm showing my ignorance, but does OCR matter in this case?  We already
have OCR capabilities, and I had intended to scan in the documents using
it--because, why not, if you can?  I didn't think to mention it in my
original post to the list because I didn't think it would change the
average file size significantly.


___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-17 Thread C. Thomas Stover
On Thu, 17 Jan 2013 07:55:09 -0600
Carson Chittom car...@wistly.net wrote:

 C. Thomas Stover c...@thomasstover.com writes:
 
  Well if hardcopy means scanned paper (no ocr) then it sounds like a
  very large binary file set. 
 
 I'm showing my ignorance, but does OCR matter in this case?  We
 already have OCR capabilities, and I had intended to scan in the
 documents using it--because, why not, if you can?  I didn't think to
 mention it in my original post to the list because I didn't think it
 would change the average file size significantly.
 
 

Well think about like this. In order to get a good enough detail for
most purposes, these document scanners have somewhere around 600x600dpi
resolution. At first you might think monochrome would work great (and
it is still used sometimes with very high res modes), but in practice
gray scale (or color) is really needed for handwriting, old paper,
charts, and all sorts of applications. So the uncompressed bitmap for
a single page can be quite big. 

So what about image/raster data compression? Well you either have
loss-less (PNG) which works great for rendered vector graphics
(diagrams, screen shots, etc), or loss-y (JPEG) which uses the
characteristics of they way human vision processes colors to really
work great for photographs. Neither one of these work that good for
generic pieces of paper. What ends up happening is people just do an
image resize to a smaller resolution, which (especially for
handwriting) can be self defeating. 

On the other hand think how much space it takes for a page of UTF-8
text. Not much. So perfect OCR (which is a virtual impossibility) would
take a 10+mb bitmap and convert it into a 2k text file. The solution
today's technology uses is by using a container format like PDF where
both images and text can be stored, the scanner software/firmware will
OCR what it can and then mix that with little cropped images. This of
course leads to the your mileage may very file sizes.

-- 
C. Thomas Stover
www.thomasstover.com


___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-17 Thread j. van den hoff
On Thu, 17 Jan 2013 18:53:43 +0100, C. Thomas Stover  
c...@thomasstover.com wrote:



On Thu, 17 Jan 2013 07:55:09 -0600
Carson Chittom car...@wistly.net wrote:


C. Thomas Stover c...@thomasstover.com writes:

 Well if hardcopy means scanned paper (no ocr) then it sounds like a
 very large binary file set.

I'm showing my ignorance, but does OCR matter in this case?  We
already have OCR capabilities, and I had intended to scan in the
documents using it--because, why not, if you can?  I didn't think to
mention it in my original post to the list because I didn't think it
would change the average file size significantly.




Well think about like this. In order to get a good enough detail for
most purposes, these document scanners have somewhere around 600x600dpi
resolution. At first you might think monochrome would work great (and
it is still used sometimes with very high res modes), but in practice
gray scale (or color) is really needed for handwriting, old paper,
charts, and all sorts of applications. So the uncompressed bitmap for
a single page can be quite big.

So what about image/raster data compression? Well you either have
loss-less (PNG) which works great for rendered vector graphics
(diagrams, screen shots, etc), or loss-y (JPEG) which uses the
characteristics of they way human vision processes colors to really
work great for photographs. Neither one of these work that good for
generic pieces of paper. What ends up happening is people just do an
image resize to a smaller resolution, which (especially for
handwriting) can be self defeating.

On the other hand think how much space it takes for a page of UTF-8
text. Not much. So perfect OCR (which is a virtual impossibility) would
take a 10+mb bitmap and convert it into a 2k text file. The solution
today's technology uses is by using a container format like PDF where
both images and text can be stored, the scanner software/firmware will
OCR what it can and then mix that with little cropped images. This of
course leads to the your mileage may very file sizes.



just my 2c:
there's also djvu http://djvu.org/ which provides astonishingly good  
compression for scanned documents, separation of layers, OCR etc. and  
there are converters

from pdf to djvu around.
otherwise I don't think that a SCM is really the suitable tool for your  
intended purpose (which I perceive as maintaining/backing up a list of  
versioned binary
files): all SCMs that I know are not really good at handling big binary  
data sets (and delta-compression sure will not work that great...). so the  
repo will get
real big in no time (and, for a DVCS, be copied to each and every user's  
account). but all the things the SCM offers (diffing, branching, merging)  
will _not_ work
with binary data in a sensible way (I believe) and this also seems not to  
be what you need anyway. so the question is: why put it under revision  
control at all?
the meta-information provided by the checkin-messages in the timeline  
alone would not be a sufficient reason in my view.
I could imagine that a very basic solution (use the file system and  
maintain a logfile (or a (sqlite?) database (or fossil repo) of the  
metadata (file xyz.version_123 has this or that content and can be found  
here: and, possibly, as already suggested, an OCR dump) is more sensible.  
as said: just my 2c.


--
Using Opera's revolutionary email client: http://www.opera.com/mail/
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-17 Thread C. Thomas Stover
On Thu, 17 Jan 2013 19:48:20 +0100
Stephan Beal sgb...@googlemail.com wrote:

 FWIW: if the documents are having to be archived for legal reasons
 then the OCR versions are essentially only useful for convenience in
 searching, and not for legal purposes.

that's good information to know

On Thu, 17 Jan 2013 19:51:58 +0100
j. van den hoff veedeeh...@googlemail.com wrote:


 just my 2c:
 there's also djvu http://djvu.org/ which provides astonishingly good  
 compression for scanned documents, separation of layers, OCR etc.

always good to find new things


 otherwise I don't think that a SCM is really the suitable tool for
 your intended purpose (which I perceive as maintaining/backing up a
 list of versioned binary
 files): all SCMs that I know are not really good at handling big
 binary data sets (and delta-compression sure will not work that
 great...). so the repo will get
 real big in no time 

yep. I've tried this a number of ways with photos, and it just didn't
work out. Although I have stored large number of mostly text-based pdf's
in scm before for lack of better tool, and it wasn't the end of the
world.

Someday someone will create a tool to fill in the gap. Sort of a DVCS
style metadata logging and control facility to a rsync style
technology. Kind of like some of the interpretations of distributed
file system back in the plan 9 lineage of thought for instance.

C. Thomas Stover
www.thomasstover.com


___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-17 Thread David Baxendale (GMail - Singapore)
I don't think Fossil is the right tool for this, take a look at Calibre  
(http://calibre-ebook.com/)  as an Open Source document management 
system, not just an e-book reader.


Calibre manages your e-book/book/PDF collection and can sort the books 
in your library by: Title, Author, Date added, Date published, Size, 
Rating, Series, etc. In addition, it supports extra searchable metadata:


 * Tags: A flexible system for categorizing your collection however you
   like
 * Comments: A long form entry that you can use for book description,
   notes, reviews, etc
 * User fields, so you can have a revision code, or you could include
   the revision code in the title (probably better), for example

You can easily search your collection for a particular book. Calibre 
supports searching any and all of the fields mentioned above. You can 
construct advanced search queries by clicking the helpful Advanced 
search button to the left of the search bar.


You can export arbitrary subsets of your collection to your hard disk 
arranged in a fully customizable folder structure.


For group access Calibre has a built-in web server that allows you to 
access your collection using a simple browser from any computer anywhere 
in the world. It can also email your books and downloaded news to you 
automatically. It has support for mobile devices, so you can browse your 
collection and download books from your smartphone, Kindle, etc.


One point to note is that systems files the documents by Author/Title on 
the hard disk, this is fixed and you cannot change this. However, this 
is not as inflexible as it sounds, because the Author could be a Client, 
Journal, or whatever you wish.


I use Calibre for my technical library with over 8000 technical papers 
and have found it an indispensable tool for managing and finding 
information.


--


Regards,



David Baxendale

 



Message: 5
Date: Wed, 16 Jan 2013 19:31:59 -0500
From: Tomek Kotttkott.li...@outlook.com
To: Fossil SCM user's discussionfossil-users@lists.fossil-scm.org
Subject: Re: [fossil-users] some questions about
fossil-as-document-repo
Message-ID:col002-w50fb60fc6525b15acfa82af3...@phx.gbl
Content-Type: text/plain; charset=iso-8859-1

Might I suggest the following two tools as better suited for this sort of 
endeavor?

1) Zotero -http://www.zotero.org/  
2) PDF XChange for free OCR -http://www.tracker-software.com/product/pdf-xchange-viewer  


The first is a good pdf sorter that can work in stand alone mode. You can 
also tag things with metadata / tags / years etc.

The second is a free PDF reader that I use instead of Adobe, and recently it was updated 
with free OCR. In my use the OCR has actually been very good. It can place the text of 
the PDF behind the image, so you can select the text while viewing the 
original scanned copy. I do this for bills and such at home.

I personally don't see fossil as the right tool for a document repo.

Tomek


Date: Wed, 16 Jan 2013 16:33:09 -0600
From:c...@thomasstover.com
To:fossil-users@lists.fossil-scm.org
Subject: Re: [fossil-users] some questions about fossil-as-document-repo

On Wed, 16 Jan 2013 16:11:49 -0600
Carson Chittomcar...@wistly.net  wrote:


Yes, basically, it's the probably should save for later need--mostly
for legal reasons.  Currently all this is in hardcopy, as I mentioned,
the volume of which has reached such a level as to be simply
impenetrable; part of the reason for putting them as images into a
repository is simply to organize them.

Well if hardcopy means scanned paper (no ocr) then it sounds like a
very large binary file set. That sort of thing quickly gets up larger
than most photo collections. The logic of the concept is sound. Report
back on how it goes in practice.

--
C. Thomas Stover
www.thomasstover.com


___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users



___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-17 Thread John Griessen

On 01/17/2013 02:29 PM, Carson Chittom wrote:

But these are not legal documents in the sense I
think you mean--contracts, etc.  Our lawyer keeps those.  Our use case is more
of a question of one of our staff being able to find something that
documents that previously we did x in case y, so if we get case z we
should also do x if y = z,


Then OCR is what you want and any OCR typos can be caught by the reader.
Don't store the images.
Then the diffs and compressions *will* work in the SCM.

On 01/17/2013 07:33 PM, David Baxendale (GMail - Singapore) wrote: Calibre manages your e-book/book/PDF collection and can sort 
the books in your library by: Title, Author, Date added, Date

 published, Size, Rating, Series, etc. In addition, it supports extra 
searchable metadata:

   * Tags: A flexible system for categorizing your collection however you like
   * Comments: A long form entry that you can use for book description, notes, 
reviews, etc
   * User fields, so you can have a revision code, or you could include the revision code in the title (probably better), for 
example



Calibre does sound good.  I'm going to look into it for managing
datasheets used in electronics designs.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-16 Thread Carson Chittom
C. Thomas Stover c...@thomasstover.com writes:

 On Tue, 15 Jan 2013 16:37:46 -0600
 Carson Chittom car...@wistly.net wrote:

 While I realize that this is a somewhat different emphasis than
 fossil's usual orientation, I have suggested to my work superiors
 that fossil may be usable for us as a document repository, given its
 (lack of) cost and its additional features we could leverage.  Almost
 exclusively, the documents we have are binary files, primarily PDFs,
 as we are largely scanning in paper documents.

 This hits close to home. My 2ยข (or your choice of 3.5% tip) is start
 with the question Why are you doing this?. Specifically, if you are
 trying to have a dumping ground for random pdf's in your group that are
 valued around probably should save this for later, then this approach
 (up to a point) is probably feasible. 

Yes, basically, it's the probably should save for later need--mostly
for legal reasons.  Currently all this is in hardcopy, as I mentioned,
the volume of which has reached such a level as to be simply
impenetrable; part of the reason for putting them as images into a
repository is simply to organize them.

 On the other hand, if you are trying to do actual collaborative work on
 documents, then it is absolutely critical that in addition to a SCM
 system (fossil would be great), that you move to a text/source based
 document generation technology. That is generally a much harder pill to
 swallow for most non-developer users expecting wysiwyg editing with
 magic sauce. Regardless of what efforts are expended otherwise, the
 result will always be failure. This I have learned is just a reality of
 the universe in which we live.

Unfortunately, and much to my dislike, what collaborative work we do
will probably end up being done with Microsoft Word's Track Changes
feature.  As bad as that is, it's still better than sending what are
effectively different documents back and forth, and keeping track
manually of which is the latest version.


___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-16 Thread C. Thomas Stover
On Wed, 16 Jan 2013 16:11:49 -0600
Carson Chittom car...@wistly.net wrote:

 Yes, basically, it's the probably should save for later need--mostly
 for legal reasons.  Currently all this is in hardcopy, as I mentioned,
 the volume of which has reached such a level as to be simply
 impenetrable; part of the reason for putting them as images into a
 repository is simply to organize them.

Well if hardcopy means scanned paper (no ocr) then it sounds like a
very large binary file set. That sort of thing quickly gets up larger
than most photo collections. The logic of the concept is sound. Report
back on how it goes in practice.

-- 
C. Thomas Stover
www.thomasstover.com


___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-16 Thread Tomek Kott
Might I suggest the following two tools as better suited for this sort of 
endeavor?

1) Zotero - http://www.zotero.org/ 
2) PDF XChange for free OCR - 
http://www.tracker-software.com/product/pdf-xchange-viewer 

The first is a good pdf sorter that can work in stand alone mode. You can 
also tag things with metadata / tags / years etc. 

The second is a free PDF reader that I use instead of Adobe, and recently it 
was updated with free OCR. In my use the OCR has actually been very good. It 
can place the text of the PDF behind the image, so you can select the text 
while viewing the original scanned copy. I do this for bills and such at home. 

I personally don't see fossil as the right tool for a document repo.

Tomek

 Date: Wed, 16 Jan 2013 16:33:09 -0600
 From: c...@thomasstover.com
 To: fossil-users@lists.fossil-scm.org
 Subject: Re: [fossil-users] some questions about fossil-as-document-repo
 
 On Wed, 16 Jan 2013 16:11:49 -0600
 Carson Chittom car...@wistly.net wrote:
 
  Yes, basically, it's the probably should save for later need--mostly
  for legal reasons.  Currently all this is in hardcopy, as I mentioned,
  the volume of which has reached such a level as to be simply
  impenetrable; part of the reason for putting them as images into a
  repository is simply to organize them.
 
 Well if hardcopy means scanned paper (no ocr) then it sounds like a
 very large binary file set. That sort of thing quickly gets up larger
 than most photo collections. The logic of the concept is sound. Report
 back on how it goes in practice.
 
 -- 
 C. Thomas Stover
 www.thomasstover.com
 
 
 ___
 fossil-users mailing list
 fossil-users@lists.fossil-scm.org
 http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
  ___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] some questions about fossil-as-document-repo

2013-01-16 Thread Graeme Gill
C. Thomas Stover wrote:
 On the other hand, if you are trying to do actual collaborative work on
 documents, then it is absolutely critical that in addition to a SCM
 system (fossil would be great), that you move to a text/source based
 document generation technology. 

The real problem is a lack of diff  merge tools for the particular
document format. Given those, any file format can be used well in
a SCM. [The lack of such tools for formats such as MS Word and PDF
simply boggles the mind. You would think that such formats were never
used for serious documentation...]

Graeme Gill.

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users