[jira] [Comment Edited] (CSV-253) Handle absent values in input (null)

Lars Bruun-Hansen (Jira) Wed, 30 Oct 2019 12:34:29 -0700


    [ 
https://issues.apache.org/jira/browse/CSV-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962105#comment-16962105
 ]


Lars Bruun-Hansen edited comment on CSV-253 at 10/30/19 7:33 PM:
-----------------------------------------------------------------

[~ggregory]  Sorry, the whole point of the PR-51 is that {{nullString}} cannot 
handle the issue at hand. {{nullString}} feature indeed fulfills a different 
purpose. Something else is required. 

Example:

The aim is to parse the following CSV:
{noformat}
"John",,""{noformat}
 

What happens when using the {{nullString}} feature to tackle the problem is 
summarized below:
||Setting||element1||element2||element3||
|<expected result>|"John"|null|""|
|with nullString = null|"John"|""|""|
|with nullString = ""|"John"|null|null|

As can be seen, there is no way to achieve the desired result. This is 
essentially because Apache CSV at the moment has no concept of what I call an 
_absent value_. To the Lexer, element2 and element3 have the same value. They 
dont!

With the PR the parser becomes aware of the difference between element2 and 
element3.

You can also see [this 
question|https://stackoverflow.com/questions/34734125/apache-common-csvparser-csvrecord-to-return-null-for-empty-fields]
 on SO. In one of the answers, the Apache CSV library is getting lamented for 
not being able to handle this situation. This is unfortunately correct.

 
h3. Why two settings?

Of course there's a certain conceptual overlap between the proposed new setting 
on formatter, {{absentIsNull}}, and the existing {{nullString}} and if the 
library was designed  again from scratch then they could probably be conflated. 
But now we have the history, and the way {{nullString}} works cannot be touched 
as it would break backwards compatibility. Also I believe 99.9% percent of 
users of the library would actually want to parse an absent value as null, but 
I don't dare to propose that as a new default as it would break backwards 
compatibility. Hence, I propose a new setting on Formatter and I propose it to 
be an opt-in feature.

 

What is looks like in the Javadoc (which also verbally references 
{{nullString}})

  !2019-10-30 20_31_39-Apache Commons CSV 1.8-SNAPSHOT API.png!

 


was (Author: lbruun):
[~ggregory]  Sorry, the whole point of the PR-51 is that {{nullString}} cannot 
handle the issue at hand. {{nullString}} feature indeed fulfills a different 
purpose. Something else is required. 

Example:

The aim is to parse the following CSV:
{noformat}
"John",,""{noformat}
 

What happens when using the {{nullString}} feature to tackle the problem is 
summarized below:
||Setting||element1||element2||element3||
|<expected result>|"John"|null|""|
|with nullString = null|"John"|""|""|
|with nullString = ""|"John"|null|null|

As can be seen, there is no way to achieve the desired result. This is 
essentially because Apache CSV at the moment has no concept of what I call an 
_absent value_. To the Lexer, element2 and element3 have the same value. They 
dont!

With the PR the parser becomes aware of the difference between element2 and 
element3.

You can also see [this 
question|https://stackoverflow.com/questions/34734125/apache-common-csvparser-csvrecord-to-return-null-for-empty-fields]
 on SO. In one of the answers, the Apache CSV library is getting lamented for 
not being able to handle this situation. This is unfortunately correct.

 
h3. Why two settings?

Of course there's a certain conceptual overlap between the proposed new setting 
on formatter, {{absentIsNull}}, and the existing {{nullString}} and if the 
library was designed  again from scratch then they could probably be conflated. 
But now we have the history, and the way {{nullString}} works cannot be touched 
as it would break backwards compatibility. Also I believe 99.9% percent of 
users of the library would actually want to parse an absent value as null, but 
I don't dare to propose that as a new default as it would break backwards 
compatibility. Hence, I propose a new setting on Formatter and I propose it to 
be an opt-in feature.

 

What is looks like in the Javadoc (which also verbally references 
{{nullString}})

  !Parser-setting-absentIsNull-Javadoc.png!

 

> Handle absent values in input (null)
> ------------------------------------
>
>                 Key: CSV-253
>                 URL: https://issues.apache.org/jira/browse/CSV-253
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>            Reporter: Lars Bruun-Hansen
>            Priority: Major
>         Attachments: 2019-10-30 20_31_39-Apache Commons CSV 1.8-SNAPSHOT 
> API.png, Parser-setting-absentIsNull-Javadoc.png
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The parser must be able to handle absent values in input and translate that 
> into {{null}} as required. I see several tickets on this matter in the 
> history, but none seem to have addressed the issue, at least not for parsing. 
> For this problem, I see a need to introduce a new term:
> Definition: _Absent value_ is when there are zero characters between field 
> delimiters.
> Specifically the aim is to be able to parse the following:
> {noformat}
>     "John",,"Doe"    // 2nd element is absent
>     ,"AA",123        // 1st element is absent
>     "John",90,       // 3rd element is absent
>     "",,90           // 2nd element is absent (1st element isn't)
> {noformat}
>  
> See also CSV-93 which I think never addressed the issue, probably because the 
> reporter was happy with having the issue fixed for CSV output, not for 
> parsing.
> A PR is coming...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (CSV-253) Handle absent values in input (null)

Reply via email to