On Wed, May 6, 2015 at 3:04 PM, Barry,Nathan <nba...@cerner.com> wrote:
> Correct me if I am wrong, but with the current code when an unescaped > quote is encountered the code doesn¹t always blow up, rather it become out > of sequence with the open/close quotes which will often lead to scenarios > where both the field delimiters and eol markers are now treated as if they > are inside a quoted attribute (i.e. ignored) resulting in a CSV record > that is potentially huge, with the size being determined when the code > finds the next unescaped quote, at which point the code will honor the > next field delimited/eol marker. > That seems like the sort of thing worth fixing in the core, IMHO. > > So in our bad file examples CSV records were created with individual > record sizes in the 10s of MB rather than the expected ~3-4KB. > > What we were looking to do was to: > - when currently in a quoted field > - if we find another quote > - look to the next character and see: > - if it¹s a delimiter, eol or eof marker - then close the quote and keep > processing normally > - if not a delimited, eol or eof marker - then we have a bad record, so > ignore all quotes & delimiters and simply look for the next eol/eof and > break the record there > > The thought being that the 1 bad record won¹t corrupt the entire file or > corrupt the record splits; though the consumers of each record would then > encounter errors when trying to parse the record, but they can then > determine the best course of action: ignore it, reject it, reject the > whole file, etc. > > Nathan > > On 5/5/15, 9:30 AM, "Champion,Mac" <mac.champ...@cerner.com> wrote: > > >Some users of the CSV Input Format at Cerner had some issues with CSV > >files from clients where there were stray, unescaped double-quotes inside > >of fields (ostensibly representing inches). Some bureaucratic stuff > >prevented us from getting those files reliably cleaned up, so we > >brainstormed and figured out a way to make the CSV Input Format able to > >ignore the stray quotes and pass them forward to be handled by whatever > >parsing solution comes later. We are working on implementing this into > >our copy of the input format and it seems to be working so far. > > > >My question is, is this something that we should log a JIRA for and > >submit our work to Crunch as well? It¹s handy in our case, but the files > >are truly malformed and not following the CSV standards. Should the > >CSVInputFormat have configurable options to be able to handle malformed > >files and pass bad records forward, or is the current behavior (blow up > >and give some info about where the bad records start) the way it truly > >should behave? > > > >Thanks for your input, > >Mac > > > >CONFIDENTIALITY NOTICE This message and any included attachments are from > >Cerner Corporation and are intended only for the addressee. The > >information contained in this message is confidential and may constitute > >inside or non-public information under international, federal, or state > >securities laws. Unauthorized forwarding, printing, copying, > >distribution, or use of such information is strictly prohibited and may > >be unlawful. If you are not the addressee, please promptly delete this > >message and notify the sender of the delivery error by e-mail or you may > >call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) > >(816)221-1024. > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>