A friend has sent me ten large text files, in the hopes that I can munge
them into shape for import into a database. Normally, I'd use a combination
of BBEdit and cubic hours hand-editing files such as these, but the sheer
volume of text and the overwhelming variations in the original documents
has gotten me thinking I may be overlooking something.
The files are listings that look something like this:
N.Y. Times
Feb., 1933 Page
5 NATIONALISM RISES UNDER HITLER RULE E-3
ANTI-HITLER CARTOON E-5
HOOVER LOOKS BACK - AND AHEAD - ANNE O'HARE MC CORMICK Mag. 1
HITLER AT THE TOP OF HIS DIZZY PATH - EMIL LENGYEL Mag. 3
OUR FLEET PLAYS A FAR-FLUNG WAR GAME - HANSON BALDWIN Mag. 7
FARM MORTGAGES: A PRESSING NATIONAL ISSUE XX-l
6 NAZI TROOP MARCH WITH EMPIRE FLAGS AS VIOLENCE MOUNTS -
HITLER HEADS PROCESSION 1
ITALY IS EXPECTING NEW TIE WITH REICH - ARNALDO CORTESI,
ROME 4
***** DENIES W. C. BULLITT TALKS FOR COL. (EDWARD MANDELL) HOUSE
(BULLITT HAS BEEN IN FRANCE & VIENNA RECENTLY) 6
(Here's a link to a partially-edited version of one of the
files: https://dl.dropboxusercontent.com/u/10003869/test.txt)
There is a date field that may or may not have a date entry, a headline
field, and a page number field. Some of the date fields have asterisks
where the original compiler wanted to call attention to the entry, and some
of the page numbers have section prefixes.
I'm wide open to any brainstorms.
--
This is the BBEdit Talk public discussion group. If you have a
feature request or would like to report a problem, please email
"[email protected]" rather than posting to the group.
Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].