Re: [MediaWiki-l] [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

Brion Vibber Wed, 17 Jul 2013 08:38:57 -0700

I'm not sure his attitude will encourage people to work with him to his
specifications.


-- brion




On Wed, Jul 17, 2013 at 8:12 AM, David Cuenca <[email protected]> wrote:

> I'm forwarding this message by George Orwell III on en-ws [1]. I think it
> is extremely important as it offers an insight about what is wrong with
> Djvu handling on Wikisource.
>
>
> "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping coordinates)
> because the original PHP contributing a-hole for the DjVu routine on our
> servers never bothered to finish the part where the internal DjVu text
> layer is converted to a (coordinate rich) XML file using the existing
> DjVuLibre software package because, at the time, the software had issues.
>
> "That faulty DjVuLibre version was the equivalent of 4,317 versions ago and
> the issue has been long fixed now EXCEPT that the .DTD file needed to base
> the plain-text to XML conversion on still has the wrong 'folder path' on
> local DjVuLibre installs (if this is true on server installs as well, I
> cannot say for sure). Once I copied the folder to the [wrong] folder path,
> I was able to generate the XMLs all day long. These XMLs are just like the
> ones IA generates during their process (in addition to the XML that AABBY
> generates for them).
>
> "So its not that we as a community decided not to follow through with
> (coordinate rich) XML generation but got stuck with the plain-text dump
> workaround due to a DjVuLibre problem that no longer exists. Plus, the guy
> who created the beginnings of this fabulous disaster was like tick with an
> attention span deficit and moved on to conjuring up some other blasted
> thing or another instead of following up on his own workaround & finish the
> XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July 2013
> (UTC)
>
>
> [1]
>
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
>
> On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <[email protected]>
> wrote:
>
> > Just a brief comment about djvu text layer, using IA files to digging
> > deeper the topic.
> >
> > FineReader OCR stores an incredibly detailed information in a proprietary
> > format; then, various FineReader versions export something of this
> > extremely rich set of information into different outputs - one of them
> > being djvu text layer. It's worth to note that even if any information
> > stored into djvu text layer can be extracted and used, the set of
> > information wrapped into djvu text layer (both in lisp-like format or in
> > xml format) is only a minor subset of original OCR information.
> >
> > If someone is interested to get much more information, it can find it
> into
> > abbyy.xml output; and Internet Archive gives it as abbyy.gz into the list
> > of exportable files. It's a very heavy and complex xml structure but it
> is
> > possible to parse it, end to extract from it any information wrapped into
> > djvu text layer and much more - most interestingly, wortPenalty, that is,
> > word by word, the resume of degree of incertainty of OCR recognition of
> the
> > whole word.
> >
> > We (I and Aarti) are digging into this mess, with fast preliminary
> > results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some brief
> > pieces of text extracted from abbyy.gx, where doubtful  words (in the
> > opinion of OCR software) are red. They can be easily managed by
> > VisualEditor - caming simply from a simple span tag.
> >
> > Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will
> > run, it would be possible to extract text by bot from abbyy.gz (if the
> work
> > comes from IA) and to upload such text as OCR.
> >
> > Alex
> >
> >
> >
> > 2013/7/16 David Cuenca <[email protected]>
> >
> >> Hi Aubrey,
> >> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked on
> >> the djvu text extraction/merging and he was interested in following-up
> on
> >> that. Maybe he has some fresh ideas about it.
> >>
> >> Micru
> >>
> >> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni <
> [email protected]>wrote:
> >>
> >>> Hi David, Aarti, thibaud and Tpt,
> >>> please look at this thread:
> >>>
> >>>
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> >>> especially the last message.
> >>>
> >>> It seems George Orwell III knows his stuff about Djvu and Proofread
> >>> extension,
> >>> and it's probably worth digging into this "layer text" djvu thing.
> >>>
> >>> Even if I might dream of an ideal solution (a "layered structure" for
> >>> wikisource, in which text can marked up several times in different
> layers)
> >>> that is probably very far away.
> >>>
> >>> But it's still important to pave the way for further improvements, I
> >>> guess:
> >>> losing all the information from a formatted, mapped IA djvu it's not a
> >>> good thing to do, IMHO.
> >>> And the Visual Editor could help us, in the future, to keep some of
> that
> >>> information (italics, bold, etc.)
> >>>
> >>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do
> >>> something with it?
> >>>
> >>> Aubrey
> >>>
> >>
> >>
> >>
> >> --
> >> Etiamsi omnes, ego non
> >> _______________________________________________
> >> Wikisource-l mailing list
> >> [email protected]
> >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> >>
> >>
> >
> > _______________________________________________
> > Wikisource-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> >
> >
>
>
> --
> Etiamsi omnes, ego non
> _______________________________________________
> MediaWiki-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: [MediaWiki-l] [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

Reply via email to