[
https://issues.apache.org/jira/browse/CSV-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962105#comment-16962105
]
Lars Bruun-Hansen edited comment on CSV-253 at 10/29/19 3:44 PM:
-----------------------------------------------------------------
[~ggregory] Sorry, the whole point of the PR-51 is that {{nullString}} cannot
handle the issue at hand. {{nullString}} feature indeed fulfills a different
purpose. Something else is required.
Example:
The aim is to parse the following CSV:
{noformat}
"John",,""{noformat}
What happens when using the {{nullString}} feature to tackle the problem is
summarized below:
||Setting||element1||element2||element3||
|<expected result>|"John"|null|""|
|with nullString = null|"John"|""|""|
|with nullString = ""|"John"|null|null|
As can be seen, there is no way to achieve the desired result. This is
essentially because Apache CSV at the moment has no concept of what I call an
_absent value_. To the Lexer, element2 and element3 have the same value. They
dont!
With the PR the parser becomes aware of the difference between element2 and
element3.
You can also see [this
question|https://stackoverflow.com/questions/34734125/apache-common-csvparser-csvrecord-to-return-null-for-empty-fields]
on SO. In one of the answers, the Apache CSV library is getting lamented for
not being able to handle this situation. This is unfortunately correct.
h3. Why two settings?
Of course there's a certain conceptual overlap between the proposed new setting
on formatter, {{absentIsNull}}, and the existing {{nullString}} and if the
library was designed again from scratch then they could probably be conflated.
But now we have the history, and the way {{nullString}} works cannot be touched
as it would break backwards compatibility. Also I believe 99.9% percent of
users of the library would actually want to parse an absent value as null, but
I don't dare to propose that as a new default as it would break backwards
compatibility. Hence, I propose a new setting on Formatter and I propose it to
be an opt-in feature.
was (Author: lbruun):
[~ggregory] Sorry, the whole point of the PR-51 is that {{nullString}} cannot
handle the issue at hand. {{nullString}} feature indeed fulfills a different
purpose. Something else is required.
Example:
The aim is to parse the following CSV:
{noformat}
"John",,""{noformat}
What happens when using the {{nullString}} feature to tackle the problem is
summarized below:
||Setting||element1||element2||element3||
|<expected result>|"John"|null|""|
|with nullString = null|"John"|""|""|
|with nullString = ""|"John"|null|null|
As can be seen, there is no way to achieve the desired result. This is
essentially because Apache CSV at the moment has no concept of what I call an
_absent value_. To the Lexer, element2 and element3 have the same value. They
dont!
With the PR the parser becomes aware of the difference between element2 and
element3.
You can also see [this
question|https://stackoverflow.com/questions/34734125/apache-common-csvparser-csvrecord-to-return-null-for-empty-fields]
on SO. In one of the answers, the Apache CSV library is getting lamented for
not being able to handle this situation. This is unfortunately correct.
h3. Why two settings?
Of course there's a certain conceptual overlap between the proposed new setting
on formatter, {{absentIsNull}}, and the existing {{nullString}} and if the
library was designed from again scratch then they could probably be conflated.
But now we have the history, and the way {{nullString}} works cannot be touched
as it would break backwards compatibility. Also I believe 99.9% percent of
users of the library would actually want to parse an absent value as null, but
I don't dare to propose that as a new default as it would break backwards
compatibility. Hence, I propose a new setting on Formatter and I propose it to
be an opt-in feature.
> Handle absent values in input (null)
> ------------------------------------
>
> Key: CSV-253
> URL: https://issues.apache.org/jira/browse/CSV-253
> Project: Commons CSV
> Issue Type: Improvement
> Components: Parser
> Reporter: Lars Bruun-Hansen
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The parser must be able to handle absent values in input and translate that
> into {{null}} as required. I see several tickets on this matter in the
> history, but none seem to have addressed the issue, at least not for parsing.
> For this problem, I see a need to introduce a new term:
> Definition: _Absent value_ is when there are zero characters between field
> delimiters.
> Specifically the aim is to be able to parse the following:
> {noformat}
> "John",,"Doe" // 2nd element is absent
> ,"AA",123 // 1st element is absent
> "John",90, // 3rd element is absent
> "",,90 // 2nd element is absent (1st element isn't)
> {noformat}
>
> See also CSV-93 which I think never addressed the issue, probably because the
> reporter was happy with having the issue fixed for CSV output, not for
> parsing.
> A PR is coming...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)