Matt Sun updated CSV-196:
    Priority: Major  (was: Minor)

> Store the info of whether a field is enclosed by quotes
> -------------------------------------------------------
>                 Key: CSV-196
>                 URL: https://issues.apache.org/jira/browse/CSV-196
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>    Affects Versions: 1.4
>            Reporter: Matt Sun
>              Labels: easyfix, features, patch
>   Original Estimate: 48h
>  Remaining Estimate: 48h
> It will be good to have CSVParser class to store the info of whether a field 
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed 
> double quotes, but we also lost the information of original data at the same 
> time. We can tell from the CSVRecord returned whether the original data is 
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV 
> is one kind of input of Hadoop, which should splitting input data. To 
> accurately split a CSV file into pieces, the program needs to count the bytes 
> of  data CSVParser actually read. CSVParser doesn't have accurate information 
> of whether a field was enclosed by quotes, neither does it store raw data of 
> the original source. Downstream users of commons CSVParser is not able to get 
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a field indicating 
> whether the column was enclosed by quotes. While Lexer is doing getNextToken, 
> set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported, but it was marked as resolved: [CSV91] 
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22

This message was sent by Atlassian JIRA

Reply via email to