On Aug 4, 1:56 pm, Larry Bates <[EMAIL PROTECTED]> wrote: > Ryan Rosario wrote: > > On Aug 4, 8:30 am, Emile van Sebille <[EMAIL PROTECTED]> wrote: > >> John Machin wrote: > >>> On Aug 4, 6:15 pm, Ryan Rosario <[EMAIL PROTECTED]> wrote: > >>>> On Aug 4, 1:01 am, John Machin <[EMAIL PROTECTED]> wrote: > >>>>> On Aug 4, 5:49 pm, Ryan Rosario <[EMAIL PROTECTED]> wrote: > >>>>>> Thanks Emile! Works almost perfectly, but is there some way I can > >>>>>> adapt this to quote fields that contain a comma in them? > >> <snip> > > >>> Emile's snippet is pushing it through thecsvreading process, to > >>> demonstrate that his series of replaces works (on your *sole* example, > >>> at least). > >> Exactly -- just print out the results of the passed argument: > > >> rec.replace(',"',",'''").replace('",',"''',").replace('"','""').replace("'''",'"') > > >> '123,"Here is some, text ""and some quoted text"" where the quotes > >> should have been doubled",321' > > >> Where it won't work is if any of the field embedded quotes are next to > >> commas. > > >> I'd run it against the file. Presumably, you've got a consistent field > >> count expectation per record. Any resulting record not matching is > >> suspect and will identify records this approach won't address. > > >> There's probably better ways, but sometimes it's fun to create > >> executable line noise. :) > > >> Emile > > > Thanks for your responses. I think John may be right that I am reading > > it a second time. I will take a look at theCSVreader documentation > > and see if that helps. Then once I run it I can see if I need to worry > > about the comma-next-to-quote issue. > > This is a perfect demonstration of why tab delimited files are so much better > than comma and quote delimited. Virtually all software can handle table > delimited as well as comma and quote delimited, but you would have none of > these > problems if you had used tab delimited. The chances of tabs being embedded in > most data is virtually nil. > > -Larry
Thank you for all the help. I wasn't using Emile's code correctly. It fixed 99% of the problem, reducing 30,000 bad lines to about 300. The remaining cases were too difficult to pin a pattern on, so I just spent an hour fixing those lines. It was typically just adding one more " to one that was already there. Next time I am going to be much more careful. Tab delimited is probably better for my purpose, but I can definitely see there being issues with invisible tab characters and other weirdness. Ryan -- http://mail.python.org/mailman/listinfo/python-list