I'm not sure his attitude will encourage people to work with him to his specifications.
-- brion On Wed, Jul 17, 2013 at 8:12 AM, David Cuenca <[email protected]> wrote: > I'm forwarding this message by George Orwell III on en-ws [1]. I think it > is extremely important as it offers an insight about what is wrong with > Djvu handling on Wikisource. > > > "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping coordinates) > because the original PHP contributing a-hole for the DjVu routine on our > servers never bothered to finish the part where the internal DjVu text > layer is converted to a (coordinate rich) XML file using the existing > DjVuLibre software package because, at the time, the software had issues. > > "That faulty DjVuLibre version was the equivalent of 4,317 versions ago and > the issue has been long fixed now EXCEPT that the .DTD file needed to base > the plain-text to XML conversion on still has the wrong 'folder path' on > local DjVuLibre installs (if this is true on server installs as well, I > cannot say for sure). Once I copied the folder to the [wrong] folder path, > I was able to generate the XMLs all day long. These XMLs are just like the > ones IA generates during their process (in addition to the XML that AABBY > generates for them). > > "So its not that we as a community decided not to follow through with > (coordinate rich) XML generation but got stuck with the plain-text dump > workaround due to a DjVuLibre problem that no longer exists. Plus, the guy > who created the beginnings of this fabulous disaster was like tick with an > attention span deficit and moved on to conjuring up some other blasted > thing or another instead of following up on his own workaround & finish the > XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July 2013 > (UTC) > > > [1] > > http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext > > On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <[email protected]> > wrote: > > > Just a brief comment about djvu text layer, using IA files to digging > > deeper the topic. > > > > FineReader OCR stores an incredibly detailed information in a proprietary > > format; then, various FineReader versions export something of this > > extremely rich set of information into different outputs - one of them > > being djvu text layer. It's worth to note that even if any information > > stored into djvu text layer can be extracted and used, the set of > > information wrapped into djvu text layer (both in lisp-like format or in > > xml format) is only a minor subset of original OCR information. > > > > If someone is interested to get much more information, it can find it > into > > abbyy.xml output; and Internet Archive gives it as abbyy.gz into the list > > of exportable files. It's a very heavy and complex xml structure but it > is > > possible to parse it, end to extract from it any information wrapped into > > djvu text layer and much more - most interestingly, wortPenalty, that is, > > word by word, the resume of degree of incertainty of OCR recognition of > the > > whole word. > > > > We (I and Aarti) are digging into this mess, with fast preliminary > > results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some brief > > pieces of text extracted from abbyy.gx, where doubtful words (in the > > opinion of OCR software) are red. They can be easily managed by > > VisualEditor - caming simply from a simple span tag. > > > > Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will > > run, it would be possible to extract text by bot from abbyy.gz (if the > work > > comes from IA) and to upload such text as OCR. > > > > Alex > > > > > > > > 2013/7/16 David Cuenca <[email protected]> > > > >> Hi Aubrey, > >> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked on > >> the djvu text extraction/merging and he was interested in following-up > on > >> that. Maybe he has some fresh ideas about it. > >> > >> Micru > >> > >> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni < > [email protected]>wrote: > >> > >>> Hi David, Aarti, thibaud and Tpt, > >>> please look at this thread: > >>> > >>> > http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext > >>> especially the last message. > >>> > >>> It seems George Orwell III knows his stuff about Djvu and Proofread > >>> extension, > >>> and it's probably worth digging into this "layer text" djvu thing. > >>> > >>> Even if I might dream of an ideal solution (a "layered structure" for > >>> wikisource, in which text can marked up several times in different > layers) > >>> that is probably very far away. > >>> > >>> But it's still important to pave the way for further improvements, I > >>> guess: > >>> losing all the information from a formatted, mapped IA djvu it's not a > >>> good thing to do, IMHO. > >>> And the Visual Editor could help us, in the future, to keep some of > that > >>> information (italics, bold, etc.) > >>> > >>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do > >>> something with it? > >>> > >>> Aubrey > >>> > >> > >> > >> > >> -- > >> Etiamsi omnes, ego non > >> _______________________________________________ > >> Wikisource-l mailing list > >> [email protected] > >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l > >> > >> > > > > _______________________________________________ > > Wikisource-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wikisource-l > > > > > > > -- > Etiamsi omnes, ego non > _______________________________________________ > MediaWiki-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > _______________________________________________ MediaWiki-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
