[ 
https://issues.apache.org/jira/browse/CRUNCH-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043885#comment-14043885
 ] 

Brandon Inman commented on CRUNCH-414:
--------------------------------------

I see what you are getting at.  

Regardless of this particular change, you might mitigate some memory concerns 
by moving the StringBuilder and call to inputText.set() outside the loop and 
then look at moving the threshold check inside the loop like you suggested.  
You need the endOfFile check even with a threshold check in case there isn't 
enough file left to trigger the threshold.

I agree that max int is pretty big, possibly too big to be useful because by 
that point we've probably exhausted resources anyway.  Making it configurable 
would be a good way to go, tuning the default with the assumption of a few 
gigabytes of memory available to it. Optimize around stability and not 
necessarily performance, since this is an error case.  

For what it's worth, the files that I've seen are malformed with a relatively 
random distribution, and so the largest unintentionally escaped chunks are 
generally 10k or so. Without evidence either way, I'm suspecting that this 
would generalize to a lot of sources of data.  And since they won't process 
correctly anyway, even if it doesn't trigger an exception, any occurrence of 
these files will have to be corrected.

> The CSV file source needs to be a little more robust when handling multi-line 
> CSV files
> ---------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-414
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-414
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: mac champion
>            Assignee: mac champion
>            Priority: Minor
>              Labels: csv, csvparser
>             Fix For: 0.8.4
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Brandon Inman recently reported an undesriable behavior in the CSV file 
> source group of files. Currently, the CSVLineReader, if reading a malformed 
> CSV file, can enter a state where it is perpetually waiting for an end-quote 
> character. As he put it, "Malformed files are malformed files and should 
> probably fail in some regard, but a hang is obviously undesirable." 
> Essentially, the CSVLineReader needs to be tweaked in such a way that an 
> informative exception is thrown after some threshold is reached, instead of 
> basically just hanging. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to