Correct me if I am wrong, but with the current code when an unescaped quote is encountered the code doesn¹t always blow up, rather it become out of sequence with the open/close quotes which will often lead to scenarios where both the field delimiters and eol markers are now treated as if they are inside a quoted attribute (i.e. ignored) resulting in a CSV record that is potentially huge, with the size being determined when the code finds the next unescaped quote, at which point the code will honor the next field delimited/eol marker.
So in our bad file examples CSV records were created with individual record sizes in the 10s of MB rather than the expected ~3-4KB. What we were looking to do was to: - when currently in a quoted field - if we find another quote - look to the next character and see: - if it¹s a delimiter, eol or eof marker - then close the quote and keep processing normally - if not a delimited, eol or eof marker - then we have a bad record, so ignore all quotes & delimiters and simply look for the next eol/eof and break the record there The thought being that the 1 bad record won¹t corrupt the entire file or corrupt the record splits; though the consumers of each record would then encounter errors when trying to parse the record, but they can then determine the best course of action: ignore it, reject it, reject the whole file, etc. Nathan On 5/5/15, 9:30 AM, "Champion,Mac" <mac.champ...@cerner.com> wrote: >Some users of the CSV Input Format at Cerner had some issues with CSV >files from clients where there were stray, unescaped double-quotes inside >of fields (ostensibly representing inches). Some bureaucratic stuff >prevented us from getting those files reliably cleaned up, so we >brainstormed and figured out a way to make the CSV Input Format able to >ignore the stray quotes and pass them forward to be handled by whatever >parsing solution comes later. We are working on implementing this into >our copy of the input format and it seems to be working so far. > >My question is, is this something that we should log a JIRA for and >submit our work to Crunch as well? It¹s handy in our case, but the files >are truly malformed and not following the CSV standards. Should the >CSVInputFormat have configurable options to be able to handle malformed >files and pass bad records forward, or is the current behavior (blow up >and give some info about where the bad records start) the way it truly >should behave? > >Thanks for your input, >Mac > >CONFIDENTIALITY NOTICE This message and any included attachments are from >Cerner Corporation and are intended only for the addressee. The >information contained in this message is confidential and may constitute >inside or non-public information under international, federal, or state >securities laws. Unauthorized forwarding, printing, copying, >distribution, or use of such information is strictly prohibited and may >be unlawful. If you are not the addressee, please promptly delete this >message and notify the sender of the delivery error by e-mail or you may >call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) >(816)221-1024.