[jira] [Commented] (CSV-293) Add support for multiple null String values

Paul Millar (Jira) Mon, 19 Aug 2024 01:43:04 -0700


    [ 
https://issues.apache.org/jira/browse/CSV-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874817#comment-17874817
 ]


Paul Millar commented on CSV-293:
---------------------------------

Many thanks to everyone for looking into this request.

Unfortunately, I somehow missed your much earlier reply [~ggregory].  A 
pull-request would have been a natural way forward (a concrete proposal on how 
to address this issue). I'm sorry I didn't reply with such a request.  My 
thanks go to  [~sigee] for writing a patch and issuing a corresponding  pull 
request.  Looking at the diff, this is very much the direction I would have 
tried.

Another unfortunately aspect is my original report lacked a description of the 
actual use-case, but perhaps I can add this here, to help clarify the situation.

It's been a little while, but (from memory) my interest here is with working 
with data provided by the European Commission (EC) on their various portals.  
The EC provide data on their funded activity in CSV format, but some of the 
data simply records the information provided by people, with limited 
validation.  The result is that there can be a number of different textual 
representations within the CSV fields that represent "does not apply", "is 
unknown", or something similar.  For example, this could be represented by an 
empty string, by some white-space, by a single dash ({{-}}), etc.  Personally, 
I would map all such values to the {{null}} value.  (I believe this is the 
use-case that CSVW is addressing through supporting multiple null values, but 
I'm not so familiar with CSVW.)

This use-case is somewhat different from simply handling the null values from 
some database or spreadsheet.  In that other use-case, a single, consistent 
string value is used to represent the software's internal null value.  Here, a 
single null value (in {{CSVFormat}}) makes sense.

I must admit that, now, I cannot find the connection between Commons CSV 
project and the CSVW standard.  I searched, but there doesn't seem to be a 
project that is using Commons CSV to support CSVW.  So, either something has 
changed over the years or I was speculating at the time (admittedly without 
saying so).

Thanks for your comments, [~ggregory].  I broadly agree with your analysis, 
although there are, perhaps, a few follow-on questions.

There is a natural asymmetry here.  With the patch from [~sigee]: when 
_writing_ a CSV file, only one String is used to represent a `null` value.  (I 
would image this is the String returned by {{CSVFormat#getNullString}}).  
However, with his patch, a {{CSVFormat}} object may be configured to accept 
multiple Strings as null values.  This would make it impossible to "round-trip" 
a file and have Commons CSV write the same content.  This isn't important for 
my use-case, but is (perhaps) a bad "smell".

AFAIK, the codebase currently assumes reading and writing are symmetric 
operations: both are handled by the same {{CSVFormat}} class.  There's 
certainly a large overlap between these two operations, but would it make sense 
to break this symmetry; for example, by allowing some aspects of {{CSVFormat}} 
apply only when reading or only when writing?  Perhaps not for this use-case, 
but there might be other aspects of reading and writing where enforcing this 
symmetry is problematic.

Going back to my use-case, I'm also wondering whether this might be a specific 
example of a more general concept of data normalisation; for example, a 
particular field might be case insensitive, but a CSV file might contain a 
mixture of upper- and lower-case values.  A data normalisation step might 
convert all such values to their lower-case equivalent.  This could break the 
ability to "round-trip" a CSV file, but in this case there's a clear decision 
to involve a data-normalisation step.  There are also work-arounds (e.g., 
keeping the original value) if round-tripping is important.

If I've understood [~ggregory]'s comments correctly, he's hinting more in this 
direction: pushing support for multiple null values to the application (either 
via multiple parsing attempts via multiple {{CSVFormat}} objects, or converting 
text to null in the application).  I think this is a reasonable choice.  
However, I wonder whether the Commons CSV library might play a role in any such 
data normalisation layer step.  (I think it currently doesn't support this, but 
I haven't checked.)

In any case, thanks again to [~sigee] and [~ggregory] for your time in 
investigating this issue.

> Add support for multiple null String values
> -------------------------------------------
>
>                 Key: CSV-293
>                 URL: https://issues.apache.org/jira/browse/CSV-293
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>            Reporter: Paul Millar
>            Priority: Minor
>
> The [CSVW namespace|https://www.w3.org/ns/csvw] provides metadata describing 
> a CSV file.  One element of this is the ability to associate certain certain 
> values with the {{null}} value, as recorded by the [csvw:null 
> property|https://www.w3.org/ns/csvw#property-definitions].
> This definition corresponds (broadly) to the "null String" concept (see 
> [org.apache.commons.csv.CSVFormat#setNullString|http://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.Builder.html#setNullString-java.lang.String-]),
>  with one noticeable difference: {{CSVFormat}} supports only a single "null 
> String" value while CSVW, through {{csvw:null}}, supports multiple Strings.
> In order to fully support CSVW, it would be helpful if {{CSVFormat}} were to 
> be updated to allow multiple null String values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CSV-293) Add support for multiple null String values

Reply via email to