Re: Trying to fix Invalid CSV File

Ryan Rosario Tue, 05 Aug 2008 22:26:39 -0700

On Aug 4, 1:56 pm, Larry Bates <[EMAIL PROTECTED]> wrote:
> Ryan Rosario wrote:
> > On Aug 4, 8:30 am, Emile van Sebille <[EMAIL PROTECTED]> wrote:
> >> John Machin wrote:
> >>> On Aug 4, 6:15 pm, Ryan Rosario <[EMAIL PROTECTED]> wrote:
> >>>> On Aug 4, 1:01 am, John Machin <[EMAIL PROTECTED]> wrote:
> >>>>> On Aug 4, 5:49 pm, Ryan Rosario <[EMAIL PROTECTED]> wrote:
> >>>>>> Thanks Emile! Works almost perfectly, but is there some way I can
> >>>>>> adapt this to quote fields that contain a comma in them?
> >> <snip>
>
> >>> Emile's snippet is pushing it through thecsvreading process, to
> >>> demonstrate that his series of replaces works (on your *sole* example,
> >>> at least).
> >> Exactly -- just print out the results of the passed argument:
>
> >> rec.replace(',"',",'''").replace('",',"''',").replace('"','""').replace("'''",'"')
>
> >> '123,"Here is some, text ""and some quoted text"" where the quotes
> >> should have been doubled",321'
>
> >> Where it won't work is if any of the field embedded quotes are next to
> >> commas.
>
> >> I'd run it against the file.  Presumably, you've got a consistent field
> >> count expectation per record.  Any resulting record not matching is
> >> suspect and will identify records this approach won't address.
>
> >> There's probably better ways, but sometimes it's fun to create
> >> executable line noise.  :)
>
> >> Emile
>
> > Thanks for your responses. I think John may be right that I am reading
> > it a second time. I will take a look at theCSVreader documentation
> > and see if that helps. Then once I run it I can see if I need to worry
> > about the comma-next-to-quote issue.
>
> This is a perfect demonstration of why tab delimited files are so much better
> than comma and quote delimited.  Virtually all software can handle table
> delimited as well as comma and quote delimited, but you would have none of 
> these
> problems if you had used tab delimited.  The chances of tabs being embedded in
> most data is virtually nil.
>
> -Larry


Thank you for all the help. I wasn't using Emile's code correctly. It
fixed 99% of the problem, reducing 30,000 bad lines to about 300. The
remaining cases were too difficult to pin a pattern on, so I just
spent an hour fixing those lines. It was typically just adding one
more " to one that was already there.

Next time I am going to be much more careful. Tab delimited is
probably better for my purpose, but I can definitely see there being
issues with invisible tab characters and other weirdness.

Ryan
--
http://mail.python.org/mailman/listinfo/python-list

Re: Trying to fix Invalid CSV File

Reply via email to