[ 
https://issues.apache.org/jira/browse/CRUNCH-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044088#comment-14044088
 ] 

mac champion commented on CRUNCH-414:
-------------------------------------

Not sure I understand about moving the .set/stringBuilder stuff. Each execution 
of the loop is another in-quotes line which needs to have a newline appended to 
it. I think it still needs to be appended inside of the loop, right?

I added the EOF check to the do-while, you're definitely right about that. 

What do you think about the threshold being split size? 64mb is the default, 
but it can be configured (check CSVInputFormat's getSplits method to see how 
it's used). That seems like a decently logical place to abort if the end of the 
record hasn't been found. Can you think of a situation where one CSV record 
would be larger than the size of the pieces the CSV file should be split into?

As for evenly-malformed files, you're right, they won't trigger an exception 
here, but will have be dealt with either manually or by more detailed parsing 
after these lines are read. 

> The CSV file source needs to be a little more robust when handling multi-line 
> CSV files
> ---------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-414
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-414
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: mac champion
>            Assignee: mac champion
>            Priority: Minor
>              Labels: csv, csvparser
>             Fix For: 0.8.4
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Brandon Inman recently reported an undesriable behavior in the CSV file 
> source group of files. Currently, the CSVLineReader, if reading a malformed 
> CSV file, can enter a state where it is perpetually waiting for an end-quote 
> character. As he put it, "Malformed files are malformed files and should 
> probably fail in some regard, but a hang is obviously undesirable." 
> Essentially, the CSVLineReader needs to be tweaked in such a way that an 
> informative exception is thrown after some threshold is reached, instead of 
> basically just hanging. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to