On Thu, 24 Jan 2008 11:53:04 -0500 Bernardo Rechea <[EMAIL PROTECTED]> wrote: 

BR> It does. I've been bit by linebreaks within a CSV field before. People add 
BR> line breaks in spreadsheets and resize spreadsheet columns until it looks 
BR> good on their screen/computer/spreadsheed, and don't realize it almost 
BR> assuredly won't look anything like it anywhere else... It can be an 
insidious 
BR> and frustrating problem until you realize that the CSV format allows for 
BR> that. Quoting from Text::CSV_XS: "The CSV file format does not require a 
BR> specific character encoding, byte order, or line terminator format.
BR> Each record is one line terminated by a line feed (ASCII/LF=0x0A) or a 
BR> carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A), however, 
BR> line-breaks can be embedded."

Hi, Alex,

It may make sense to write a pre-processing scanner that will go through
the file and eliminate (or mark up) line breaks.  You need to beware CSV
quote escapes and all the other junk CSV/TSV formats out there, but
maybe you have a few hundred of these feeds that you can use for
testing...

It should be easier to write this scanner than a general CSV parsing
module; after it's done you can just fire up the usual CSV processing.
Essentially it's a simple FSM with three states: inside_field,
inside_quoted_field, and outside_field.  FSMs can be annoying to write in
general because you end up with a lot of states and transitions, but 3
states is not too bad, and the code will be fast.

Ted
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to