Andre, this is a great summary -- I've linked to it from the english ws Scriptorium.
Do you see opportunities for the two projects to coordinate their wofklows better? SJ On Thu, Jun 24, 2010 at 11:13 PM, Andre Engels <[email protected]> wrote: > On Thu, Jun 24, 2010 at 4:37 PM, Samuel Klein <[email protected]> wrote: >> I love those proofreading features, and the new default layout for a >> book's pages and TOC. Wikisource is becoming AWESOME. >> >> Do we have PGDP contributors who can weigh on on how similar the >> processes are? Is there a way for us to actually merge workflows with >> them? > > I am quite active on PGDP, but not on Wikisource, so I can tell about > how things work there, but not on how similar it is to Wikisource. > > Typical about the PGDP workflow are an emphasis on quality above > quantity (exemplified in running not 1 or 2 but 3 rounds of human > checking of the OCR result - correctness in copying is well above > 99.99% for most books) and work being done in page-size chunks rather > than whole books, chapters, paragraphs, sentences, words or whatever > else one could think of. > > There's a number of people involved, although people can and often do > fill several roles for one book. > > First, there is the Content Provider (CP). > > He or she first contacts Project Gutenberg to get a clearance. This is > basically a statement from PG that they believe the work is out of > copyright. In general, US copyright is what is taken into account for > this, although there are also servers in other countries (Canada and > Australia as far as I know), which publish some material that is out > of copyright in those countries even if it is not in the US. Such > works do not go through PGDP, but may go through its sister projects > DPCanada or DPEurope. > > Next, the CP will scan the book, or harvest the scans from the web, > and run OCR on them. They will usually also write a description of the > book for the proofreaders, so those can see whether they are > interested. The scans and the OCR are uploaded to the PGDP servers, > and the project is handed over to the Project Manager (PM) (although > in most cases CP and PM are the same person). > > The Project Manager is responsible for the project in the next stages. > This means: > * specifying the rules and guidelines that are to be followed when > proofreading the book, at least there where those differ from the > standard guidelines > * answer questions by proofreaders > * keep the good and bad words lists up to date. These are used in > wordcheck (a kind of spellchecker) so that words are considered > correct or incorrect by it > > The project then goes through a number of rounds. The standard number > is 5 rounds, of which 3 are proofreading and 2 are formatting, but it > is possible for the PM to make a request to skip one or more rounds or > go through a round twice. > > In the first three, proofreading, rounds, a proofreader requests one > page at a time, compares the OCR output (or the previous proofreader's > output) with the scan, and changes the text to correspond to the scan. > In the first round (P1) everyone can do this, the second round (P2) is > only accessible to those who have been at the site some time and done > a certain amount of pages (21 days and 300 pages, if I recall > correctly), for the third round (P3) one has to qualify. For > qualification one's P2 pages are checked (using the subsequent edits > of P3). The norm is that one should not leave more than one error per > five pages. > > After the three (or two or four) rounds of proofing, the foofing > (formatting) rounds are gone through. In these, again a proofreader > (now called formatter) requests and edits one page at the time, but > where the proofreaders dealt with copying the text as precisely as > possible, the formatter will deal with all other aspects of the work. > They denote when text is italic, bold or otherwise in a special > format, which texts are chapter headers, how tables are laid out, > etcetera. Here there are two rounds, although the second one can be > skipped or a round duplicated, like before. The first formatting round > (F1) has the same entrance restrictions as P2, F2 has a qualification > system comparable to P3. > > After this, the PM gives the book on to the Post-Processor (PP). > Again, this is often the same person, but not always. In some other > cases, the PP has already been appointed, in others it will sit in a > pool until picked up by a willing PP. The PP does all that is needed > to get from the F2 output to something that can be put on Project > Gutenberg: they recombine the pages into one work, move stuff around > where needed, change the formatters' mark-up in something that's more > appropriate for reading, in most cases generate an HTML version, > etcetera. > > A PP that has already post-processed several books in a good way can > then send it to PG. In other cases, the book will then go to the PPV > (Post-Processing Verifier), an experienced PP, who checks the PP's > work, and gives them hints on what should be improved or makes those > improvements themselves. > > Finally, if the PP or PPV sends the book to PG, there is a whitewasher > who checks the book once again; however, that is outside the scope of > this (already too long) description, because it belongs to PG's > process rather than PGDP's. > > To stop the rounds from overcrowding with books, there are queues for > each round, containing books that are ready to enter the round, but > have not yet done so. To keep some variety, there are different queues > by language and/or subject type. A problem with this has been that the > later rounds, having less manpower because of the higher standards > required, could not keep up with P1 and F1. There has been work to do > something about it, and the P2 queues have been brought down to decent > size, but in P3 and F2 books can literally sit in the queues for > years, and PP still is a bottleneck as well. > > > -- > André Engels, [email protected] > > _______________________________________________ > foundation-l mailing list > [email protected] > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l > -- Samuel Klein identi.ca:sj w:user:sj _______________________________________________ foundation-l mailing list [email protected] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
