Re: [Jchat] Fwd: Archive j

Skip Cave Sun, 15 Dec 2013 07:29:27 -0800

Brian,

The handy thing about Acrobat is that when you scan and OCR a document, the
scanned image is kept as part of the new document. Which means that when
you view the doc in Acrobat, you will see the image of the original
document, with all signatures, markups, scribbled notes, etc. . Acrobat
then OCR's the image and places the OCR'd text as an invisible text layer
on top of the scanned image. Acrobat tries to align the invisible OCR text
with the appropriate text in the scanned image. In this way you can select,
copy, and paste text from the PDF document while viewing the scanned image
in Acrobat. In addition, you can index the document for subsequent keyword
searches.


Tabular *data* is accurately recognized by Acrobat's OCR engine. However,
the tabular *formatting* is typically not recognized, so tabular control
characters such as tabs are not automatically inserted in the OCR'd tables
in a PDF document. Thus, If you try to copy and paste tabular data from an
OCR'd PDF doc directly to a spreadsheet, the data will be correct, but the
columnar formatting will likely be missing.. Thus a mass copy and paste of
tabular data from a OCR'd table to a spreadsheet will likely not do exactly
what you would like. Rows of numbers will probably be placed in a single
cell in the spreadsheet. You typically will need to pre-process the copied
tabular data from an OCR's PDF to add formatting characters, before pasting
into a spreadsheet.

The Omnipage OCR program does a bit better with tabular data, in that it at
least makes an attempt at recognizing tables, and it tries to insert the
formatting controls that will replicate any table it detects.  Omnipage
even attempts to recognize fonts, as well as styles such as italics and
bold, though it isn't always successful. Omnipage can often produce a
fairly accurate pure-text Word Document or Excel spreadsheet from a scanned
image, without keeping the scanned image as a background. Each release of
Omnipage gets a bit better at converting scanned images of a document into
a pure-text Word document, but a converted doc still usually needs a bit of
touch-up with the current version (I believe Omnipage is on V19)

Skip.






On Sat, Dec 14, 2013 at 12:47 PM, Brian Schott <[email protected]>wrote:

> Skip,
>
> I have investigated OCR that comes with Fujitsu's scanner called SnapScan,
> I believe. They were pretty unsuccessful at financial docs that were mostly
> numeric, especially at maintaining the columnar format of the data. Do you
> have different experiences with Acrobat?
>
> Thanks,
>
>
> On Sat, Dec 14, 2013 at 12:51 PM, Skip Cave <[email protected]>
> wrote:
>
> > Just a comment..
> >
> > In my work I occasionally need to convert old paper documents to
> electronic
> > documents. I convert the documents to PDF, mainly because Adobe Acrobat
> can
> > automate the scanning process, and it can also OCR the docs as part of
> the
> > conversion process. Thus the electronic document archive can be keyword
> > searched, which can be quite useful in many cases.
> >
> > Skip
> >
> >
> > On Sat, Dec 14, 2013 at 11:21 AM, Dan Bron <[email protected]> wrote:
> >
> > > Brian - your diagnosis is spot-on.  I neglected to test the verb in a
> > clean
> > > J session before I posted it.  I apologize.
> > >
> > > The names 'prefix' and 'suffix' on the line that assigns 'fn' should be
> > > spelled 'pfx' and 'sfx' respectively.  Corrected (and tested!) verb
> > below.
> > >
> > > -Dan
> > >
> > > PS:  For those interested, the stack error was a cascade effect of my
> > > oversight.
> > >
> > > Since prefix and suffix were undefined, when fn was assigned, it got
> > > defined
> > > as a (meaningless) verb.  In turn, this caused pn to be defined as a
> > verb,
> > > and subsequently redefined in terms of itself.
> > >
> > > So, when the final line was executed, J attempted to call pn as a verb
> on
> > > the argument ,.<'.',sfx, causing an infinite recursion, and ultimately
> > the
> > > stack error.
> > >
> > >
> > > flipOddPages=:verb define
> > >         'comb png' flipOddPages y
> > > :
> > >         NB.y=directory to scan, x=file prefix and suffix
> > >         'pfx sfx'=.2 {. (;:^:(0=L.) x),<'png'
> > >         dir =. (,'\'-.{:)y
> > >
> > >         NB. Scan directory
> > >         fn  =. {. |: 1!:0 jpath dir,pfx,'*',sfx
> > >
> > >         NB. Zero-fill here will convert comb.png to comb 0.png
> > >         NB. The increments account for the space after comb and
> > >         NB. the dot before png respectively.
> > >         pn=.0 ".&> (1+#pfx) }.&.> (-1+#sfx)}.&.> fn
> > >
> > >         NB. Reverse order of odds
> > >         pn=.(+ 2&| * #-+:) pn
> > >
> > >         NB. Table of old-filename, new-filename
> > >         fn ,. <@;"1 (<pfx,' '),.(":&.> pn),.<'.',sfx
> > > )
> > >
> > > flipOddPages '~temp\archive1'
> > >
> > >
> > > -----Original Message-----
> > > From: [email protected]
> > > [mailto:[email protected]] On Behalf Of Brian Schott
> > > Sent: Saturday, December 14, 2013 11:56 AM
> > > To: Chat forum
> > > Subject: Re: [Jchat] Fwd: Archive j
> > >
> > > Dan,
> > >
> > > Thank you very much. That looks terrific.
> > >
> > > I am having a little trouble making it work on my Mac. Do you see the
> > > problem based on the error message below?
> > >
> > > Btw, I am suspicious of the noun `suffix` because it doesn't look like
> it
> > > is
> > > defined. I can debug further myself, but I thought you would like to
> know
> > > my
> > > progress.
> > >
> > >     'combined png' flipOddPages '/Users/brian/Documents/combined\ not'
> > > |stack error: pn
> > > |   fn,.<@;"1(<pfx,' '),.    (":&.>pn),.<'.',sfx
> > >
> > >
> > >
> > > On Sat, Dec 14, 2013 at 11:28 AM, Dan Bron <[email protected]> wrote:
> > >
> > > > Saw your message come in and have just been noodling around with the
> > > > idea of reversing only the odd pages.  Does the following help?
> > > >
> > > > flipOddPages=:verb define
> > > >         'comb png' flipOddPages y
> > > > :
> > > >         NB.y=directory to scan, x=file prefix and suffix
> > > >         'pfx sfx'=.2 {. (;:^:(0=L.) x),<'png'
> > > >         dir =. (,'\'-.{:)y
> > > >
> > > >         NB. Scan directory
> > > >         fn  =. {. |: 1!:0 jpath dir,prefix,'*',suffix
> > > >
> > > >         NB. Zero-fill here will convert comb.png to comb 0.png
> > > >         NB. The increments account for the space after comb and
> > > >         NB. the dot before png respectively.
> > > >         pn=.0 ".&> (1+#pfx) }.&.> (-1+#sfx)}.&.> fn
> > > >
> > > >         NB. Reverse order of odds
> > > >         pn=.(+ 2&| * #-+:) pn
> > > >
> > > >         NB. Table of old-filename, new-filename
> > > >         fn ,. <@;"1 (<pfx,' '),.(":&.> pn),.<'.',sfx
> > > > )
> > > >
> > > > Here, if you had a subdirectory in your J temp folder named
> > > > 'archive1', with
> > > > 10 files in it ('comb.png' plus 'comb 1.png' through 'comb 9.png'),
> > > > you'd call it like this:
> > > >
> > > >    flipOddPages '~temp\archive1'
> > > > +----------+----------+
> > > > |comb 1.png|comb 9.png|
> > > > +----------+----------+
> > > > |comb 2.png|comb 2.png|
> > > > +----------+----------+
> > > > |comb 3.png|comb 7.png|
> > > > +----------+----------+
> > > > |comb 4.png|comb 4.png|
> > > > +----------+----------+
> > > > |comb 5.png|comb 5.png|
> > > > +----------+----------+
> > > > |comb 6.png|comb 6.png|
> > > > +----------+----------+
> > > > |comb 7.png|comb 3.png|
> > > > +----------+----------+
> > > > |comb 8.png|comb 8.png|
> > > > +----------+----------+
> > > > |comb 9.png|comb 1.png|
> > > > +----------+----------+
> > > > |comb.png  |comb 0.png|
> > > > +----------+----------+
> > > >
> > > > The resulting table is a mapping of original filenames to corrected
> > > > filenames, where the order of the odd pages has been reversed (and
> the
> > > > first page has been made consistent by adding an index, 0). Actually
> > > > renaming the files is left as an exercise for the reader (but isn't
> > > > difficult).
> > > >
> > > > While packaging this up for re-use, I noticed it had a very familiar
> > > > structure.
> > > >
> > > >         read filenames
> > > >                 parse filenames  (drop pfx+' ', sfx+'.')
> > > >                 multiply page number by two
> > > >                                 subtract count
> > > >                         residue page number by two
> > > >                 format filenames (tack pfx+' ',sfx+'.')
> > > >         write filenames
> > > >
> > > > Now, I don't think it'd be worthwhile to rewrite the verb using
> under,
> > > > especially since it's got side-effects and the central &. is a bit
> > > > iffy, but I thought it interesting nonetheless.
> > > >
> > > > (Or maybe if one spends too much time with J, he begins to see unders
> > > > everywhere :)
> > > >
> > > > -Dan
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: [email protected]
> > > > [mailto:[email protected]] On Behalf Of Brian Schott
> > > > Sent: Saturday, December 14, 2013 10:00 AM
> > > > To: Chat forum
> > > > Subject: [Jchat] Fwd: Archive j
> > > >
> > > > I have been studying ways to digitally archive old paper reports
> using
> > > > a one-sided scanner with an automatic paperfeed. Two options have
> > > > occurred to me.
> > > >
> > > > One. Scan to a single PDF file,
> > > > Two. Scan to a folder containing multiple PNG files.
> > > >
> > > > The options are very similar except that the PDF mode seems to take
> > > > considerably more disk space.
> > > >
> > > > Both methods suffer slightly when the original documents are
> two-sided
> > > > because either the pages are out of order (in the PDF) or the
> > > > numbering of the individual files in the folder is out of order.
> > > >
> > > > I would like to implement a routine for the second option that would
> > > > rename the multiple PNG files. So I have developed some simple J
> verbs
> > > > to create the indexes of the filenames, but I do not know how to
> > > > associate those results with the original files and rename the
> > > > original files correctly in the case of two-sided originals.That is
> > > > where I would like some help especially in knowing what part of the
> > > > task can be done in J and what part is done in unix batch files.
> > > >
> > > > Notice in the "typical" example below of four pages that the pages of
> > > > the second side are in the reverse of the correct order and that the
> > > > very first page does not have a number associated with it.
> > > >
> > > > server:Documents brian$ ls -l combined\ not total 872
> > > > -rw-r--r--  1 brian  staff   88517 Dec 14 09:13 combined 1.png
> > > > -rw-r--r--  1 brian  staff  119690 Dec 14 09:13 combined 2.png
> > > > -rw-r--r--  1 brian  staff  129123 Dec 14 09:13 combined 3.png
> > > > -rw-r--r--  1 brian  staff  100494 Dec 14 09:13 combined.png
> > > > server:Documents brian$ ls -l combined.pdf -rw-r--r--@ 1 brian
>  staff
> > > > 971300 Dec 12 08:22 combined.pdf
> > > >
> > > >
> > > > Below are the J verbs for creating indexes based on the number of
> > > > files in a directory.
> > > >
> > > > nodd=: >.@-:
> > > > neven=: <.@-:
> > > > odd=: >:@+:@i.@nodd
> > > > even=: 2+|.@:+:@i.@neven
> > > > zero=: '0'&,@":"0
> > > >
> > > > Note 'some test cases'
> > > > <@:even"0] 0 1 2 3 4
> > > > <@:odd"0] 0 1 2 3 4
> > > > zero odd 4
> > > > zero even 3
> > > > zero even 0
> > > > 1&|. zero (odd,even) 4   NB. this would be most like
> > > > NB. a production case because it would NB. match the order of the
> file
> > > > list in the original ls
> > > > )
> > > >
> > > > Thanks,
> > > >
> > > > ---
> > > > (B=)
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > >
> > >
> > >
> > > --
> > > (B=) <-----my sig
> > > Brian Schott
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> >
> >
> >
> > --
> > Skip Cave
> > Cave Consulting LLC
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
>
>
>
> --
> (B=) <-----my sig
> Brian Schott
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>



-- 
Skip Cave
Cave Consulting LLC
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] Fwd: Archive j

Reply via email to