On Thu, 24 Jan 2008 11:53:04 -0500 Bernardo Rechea <[EMAIL PROTECTED]> wrote:
BR> It does. I've been bit by linebreaks within a CSV field before. People add BR> line breaks in spreadsheets and resize spreadsheet columns until it looks BR> good on their screen/computer/spreadsheed, and don't realize it almost BR> assuredly won't look anything like it anywhere else... It can be an insidious BR> and frustrating problem until you realize that the CSV format allows for BR> that. Quoting from Text::CSV_XS: "The CSV file format does not require a BR> specific character encoding, byte order, or line terminator format. BR> Each record is one line terminated by a line feed (ASCII/LF=0x0A) or a BR> carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A), however, BR> line-breaks can be embedded." Hi, Alex, It may make sense to write a pre-processing scanner that will go through the file and eliminate (or mark up) line breaks. You need to beware CSV quote escapes and all the other junk CSV/TSV formats out there, but maybe you have a few hundred of these feeds that you can use for testing... It should be easier to write this scanner than a general CSV parsing module; after it's done you can just fire up the usual CSV processing. Essentially it's a simple FSM with three states: inside_field, inside_quoted_field, and outside_field. FSMs can be annoying to write in general because you end up with a lot of states and transitions, but 3 states is not too bad, and the code will be fast. Ted _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

