[
https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gary D. Gregory resolved CSV-196.
---------------------------------
Fix Version/s: 1.13.0
Resolution: Fixed
> Store the information of raw data read by lexer
> -----------------------------------------------
>
> Key: CSV-196
> URL: https://issues.apache.org/jira/browse/CSV-196
> Project: Commons CSV
> Issue Type: Improvement
> Components: Parser
> Affects Versions: 1.4
> Reporter: Matt Sun
> Priority: Major
> Labels: patch
> Fix For: 1.13.0
>
> Original Estimate: 48h
> Time Spent: 40m
> Remaining Estimate: 47h 20m
>
> It will be good to have CSVParser class to store the info of whether a field
> was enclosed by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed
> double quotes, but we also lost the information of original data at the same
> time. We can't tell from the CSVRecord returned whether the original data is
> enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV
> is one kind of input of Hadoop Jobs, which should support splitting input
> data. To accurately split a CSV file into pieces, we need to count the bytes
> of data CSVParser actually read. CSVParser doesn't have accurate information
> of whether a field was enclosed by quotes, neither does it store raw data of
> the original source. Downstream users of commons CSVParser is not able to get
> those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field
> indicating whether the column was enclosed by quotes. While Lexer is doing
> getNextToken, set the flag if a field is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as
> resolved: [CSV91]
> https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22
--
This message was sent by Atlassian Jira
(v8.20.10#820010)