[ 
https://issues.apache.org/jira/browse/CRUNCH-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044739#comment-14044739
 ] 

Brandon Inman commented on CRUNCH-414:
--------------------------------------

I feel like having two outputs from the do/while loop (ie the string buffer and 
the inputText, where only one is a valid output at a time) adds extra 
complexity to the code, and the CSV code is already necessarily a little 
complicated to begin with.  Unless there is a demonstrable performance concern 
with using the string buffer for the cases that don't need it (understanding 
that records without embedded newlines may not be the majority of cases), I 
would make the string buffer the source of truth. 

Also, is the last part of the record accounted for in this algorithm?  That is, 
after we are out of the quoted section, currentlyInQuotes should be false and 
therefore the final value of inputText gets lost, rather than appended to the 
string buffer.

> The CSV file source needs to be a little more robust when handling multi-line 
> CSV files
> ---------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-414
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-414
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: mac champion
>            Assignee: mac champion
>            Priority: Minor
>              Labels: csv, csvparser
>             Fix For: 0.8.4
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Brandon Inman recently reported an undesriable behavior in the CSV file 
> source group of files. Currently, the CSVLineReader, if reading a malformed 
> CSV file, can enter a state where it is perpetually waiting for an end-quote 
> character. As he put it, "Malformed files are malformed files and should 
> probably fail in some regard, but a hang is obviously undesirable." 
> Essentially, the CSVLineReader needs to be tweaked in such a way that an 
> informative exception is thrown after some threshold is reached, instead of 
> basically just hanging. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to