Re: Enhancement to CSV input format?

Barry,Nathan Thu, 07 May 2015 09:18:13 -0700

We’ll get a Crunch JIRA logged soon, but at the moment we are debating how
the CSV Input should attempt to recover from the unescaped quote, should
the input


1) look for a proper end quote (where the next char is a delimiter, EOL or
EOF) as recover at that point, mid-line; else fall back to the next EOL
Or
2) look for the next EOL and recover at that point

Example scenario, invalid input text:
---------------------------------------------

this,is,a,"line with 1" problems","in
it"
this,is,another,line of,text

Output using 1 (With ---- as the record breaks)
---------------------------------------------

this,is,a,"line with 1" problems","in
it”
---------------------------------------------
this,is,another,line of,text
---------------------------------------------



Output using 2
---------------------------------------------

this,is,a,"line with 1" problems",”in
---------------------------------------------

it"
this,is,another,line of,text
---------------------------------------------





On 5/6/15, 9:12 AM, "Josh Wills" <jwi...@cloudera.com> wrote:

>On Wed, May 6, 2015 at 3:04 PM, Barry,Nathan <nba...@cerner.com> wrote:
>
>> Correct me if I am wrong, but with the current code when an unescaped
>> quote is encountered the code doesn¹t always blow up, rather it become
>>out
>> of sequence with the open/close quotes which will often lead to
>>scenarios
>> where both the field delimiters and eol markers are now treated as if
>>they
>> are inside a quoted attribute (i.e. ignored) resulting in a CSV record
>> that is potentially huge, with the size being determined when the code
>> finds the next unescaped quote, at which point the code will honor the
>> next field delimited/eol marker.
>>
>
>That seems like the sort of thing worth fixing in the core, IMHO.
>
>
>>
>> So in our bad file examples CSV records were created with individual
>> record sizes in the 10s of MB rather than the expected ~3-4KB.
>>
>> What we were looking to do was to:
>> - when currently in a quoted field
>> - if we find another quote
>> - look to the next character and see:
>> - if it¹s a delimiter, eol or eof marker - then close the quote and keep
>> processing normally
>> - if not a delimited, eol or eof marker - then we have a bad record, so
>> ignore all quotes & delimiters and simply look for the next eol/eof and
>> break the record there
>>
>> The thought being that the 1 bad record won¹t corrupt the entire file or
>> corrupt the record splits; though the consumers of each record would
>>then
>> encounter errors when trying to parse the record, but they can then
>> determine the best course of action:  ignore it, reject it, reject the
>> whole file, etc.
>>
>> Nathan
>>
>> On 5/5/15, 9:30 AM, "Champion,Mac" <mac.champ...@cerner.com> wrote:
>>
>> >Some users of the CSV Input Format at Cerner had some issues with CSV
>> >files from clients where there were stray, unescaped double-quotes
>>inside
>> >of fields (ostensibly representing inches). Some bureaucratic stuff
>> >prevented us from getting those files reliably cleaned up, so we
>> >brainstormed and figured out a way to make the CSV Input Format able to
>> >ignore the stray quotes and pass them forward to be handled by whatever
>> >parsing solution comes later. We are working on implementing this into
>> >our copy of the input format and it seems to be working so far.
>> >
>> >My question is, is this something that we should log a JIRA for and
>> >submit our work to Crunch as well? It¹s handy in our case, but the
>>files
>> >are truly malformed and not following the CSV standards. Should the
>> >CSVInputFormat have configurable options to be able to handle malformed
>> >files and pass bad records forward, or is the current behavior (blow up
>> >and give some info about where the bad records start) the way it truly
>> >should behave?
>> >
>> >Thanks for your input,
>> >Mac
>> >
>> >CONFIDENTIALITY NOTICE This message and any included attachments are
>>from
>> >Cerner Corporation and are intended only for the addressee. The
>> >information contained in this message is confidential and may
>>constitute
>> >inside or non-public information under international, federal, or state
>> >securities laws. Unauthorized forwarding, printing, copying,
>> >distribution, or use of such information is strictly prohibited and may
>> >be unlawful. If you are not the addressee, please promptly delete this
>> >message and notify the sender of the delivery error by e-mail or you
>>may
>> >call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)
>> >(816)221-1024.
>>
>>
>
>
>-- 
>Director of Data Science
>Cloudera <http://www.cloudera.com>
>Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Enhancement to CSV input format?

Reply via email to