We’ll get a Crunch JIRA logged soon, but at the moment we are debating how the CSV Input should attempt to recover from the unescaped quote, should the input
1) look for a proper end quote (where the next char is a delimiter, EOL or EOF) as recover at that point, mid-line; else fall back to the next EOL Or 2) look for the next EOL and recover at that point Example scenario, invalid input text: --------------------------------------------- this,is,a,"line with 1" problems","in it" this,is,another,line of,text Output using 1 (With ---- as the record breaks) --------------------------------------------- this,is,a,"line with 1" problems","in it” --------------------------------------------- this,is,another,line of,text --------------------------------------------- Output using 2 --------------------------------------------- this,is,a,"line with 1" problems",”in --------------------------------------------- it" this,is,another,line of,text --------------------------------------------- On 5/6/15, 9:12 AM, "Josh Wills" <jwi...@cloudera.com> wrote: >On Wed, May 6, 2015 at 3:04 PM, Barry,Nathan <nba...@cerner.com> wrote: > >> Correct me if I am wrong, but with the current code when an unescaped >> quote is encountered the code doesn¹t always blow up, rather it become >>out >> of sequence with the open/close quotes which will often lead to >>scenarios >> where both the field delimiters and eol markers are now treated as if >>they >> are inside a quoted attribute (i.e. ignored) resulting in a CSV record >> that is potentially huge, with the size being determined when the code >> finds the next unescaped quote, at which point the code will honor the >> next field delimited/eol marker. >> > >That seems like the sort of thing worth fixing in the core, IMHO. > > >> >> So in our bad file examples CSV records were created with individual >> record sizes in the 10s of MB rather than the expected ~3-4KB. >> >> What we were looking to do was to: >> - when currently in a quoted field >> - if we find another quote >> - look to the next character and see: >> - if it¹s a delimiter, eol or eof marker - then close the quote and keep >> processing normally >> - if not a delimited, eol or eof marker - then we have a bad record, so >> ignore all quotes & delimiters and simply look for the next eol/eof and >> break the record there >> >> The thought being that the 1 bad record won¹t corrupt the entire file or >> corrupt the record splits; though the consumers of each record would >>then >> encounter errors when trying to parse the record, but they can then >> determine the best course of action: ignore it, reject it, reject the >> whole file, etc. >> >> Nathan >> >> On 5/5/15, 9:30 AM, "Champion,Mac" <mac.champ...@cerner.com> wrote: >> >> >Some users of the CSV Input Format at Cerner had some issues with CSV >> >files from clients where there were stray, unescaped double-quotes >>inside >> >of fields (ostensibly representing inches). Some bureaucratic stuff >> >prevented us from getting those files reliably cleaned up, so we >> >brainstormed and figured out a way to make the CSV Input Format able to >> >ignore the stray quotes and pass them forward to be handled by whatever >> >parsing solution comes later. We are working on implementing this into >> >our copy of the input format and it seems to be working so far. >> > >> >My question is, is this something that we should log a JIRA for and >> >submit our work to Crunch as well? It¹s handy in our case, but the >>files >> >are truly malformed and not following the CSV standards. Should the >> >CSVInputFormat have configurable options to be able to handle malformed >> >files and pass bad records forward, or is the current behavior (blow up >> >and give some info about where the bad records start) the way it truly >> >should behave? >> > >> >Thanks for your input, >> >Mac >> > >> >CONFIDENTIALITY NOTICE This message and any included attachments are >>from >> >Cerner Corporation and are intended only for the addressee. The >> >information contained in this message is confidential and may >>constitute >> >inside or non-public information under international, federal, or state >> >securities laws. Unauthorized forwarding, printing, copying, >> >distribution, or use of such information is strictly prohibited and may >> >be unlawful. If you are not the addressee, please promptly delete this >> >message and notify the sender of the delivery error by e-mail or you >>may >> >call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) >> >(816)221-1024. >> >> > > >-- >Director of Data Science >Cloudera <http://www.cloudera.com> >Twitter: @josh_wills <http://twitter.com/josh_wills>