Re: Enhancement to CSV input format?

Barry,Nathan Wed, 06 May 2015 07:05:49 -0700

Correct me if I am wrong, but with the current code when an unescaped
quote is encountered the code doesn¹t always blow up, rather it become out
of sequence with the open/close quotes which will often lead to scenarios
where both the field delimiters and eol markers are now treated as if they
are inside a quoted attribute (i.e. ignored) resulting in a CSV record
that is potentially huge, with the size being determined when the code
finds the next unescaped quote, at which point the code will honor the
next field delimited/eol marker.


So in our bad file examples CSV records were created with individual
record sizes in the 10s of MB rather than the expected ~3-4KB.

What we were looking to do was to:
- when currently in a quoted field
- if we find another quote
- look to the next character and see:
- if it¹s a delimiter, eol or eof marker - then close the quote and keep
processing normally
- if not a delimited, eol or eof marker - then we have a bad record, so
ignore all quotes & delimiters and simply look for the next eol/eof and
break the record there

The thought being that the 1 bad record won¹t corrupt the entire file or
corrupt the record splits; though the consumers of each record would then
encounter errors when trying to parse the record, but they can then
determine the best course of action:  ignore it, reject it, reject the
whole file, etc.

Nathan

On 5/5/15, 9:30 AM, "Champion,Mac" <[email protected]> wrote:

>Some users of the CSV Input Format at Cerner had some issues with CSV
>files from clients where there were stray, unescaped double-quotes inside
>of fields (ostensibly representing inches). Some bureaucratic stuff
>prevented us from getting those files reliably cleaned up, so we
>brainstormed and figured out a way to make the CSV Input Format able to
>ignore the stray quotes and pass them forward to be handled by whatever
>parsing solution comes later. We are working on implementing this into
>our copy of the input format and it seems to be working so far.
>
>My question is, is this something that we should log a JIRA for and
>submit our work to Crunch as well? It¹s handy in our case, but the files
>are truly malformed and not following the CSV standards. Should the
>CSVInputFormat have configurable options to be able to handle malformed
>files and pass bad records forward, or is the current behavior (blow up
>and give some info about where the bad records start) the way it truly
>should behave?
>
>Thanks for your input,
>Mac
>
>CONFIDENTIALITY NOTICE This message and any included attachments are from
>Cerner Corporation and are intended only for the addressee. The
>information contained in this message is confidential and may constitute
>inside or non-public information under international, federal, or state
>securities laws. Unauthorized forwarding, printing, copying,
>distribution, or use of such information is strictly prohibited and may
>be unlawful. If you are not the addressee, please promptly delete this
>message and notify the sender of the delivery error by e-mail or you may
>call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)
>(816)221-1024.

Re: Enhancement to CSV input format?

Reply via email to