[jira] [Commented] (CSV-293) Add support for multiple null String values

Gary D. Gregory (Jira) Mon, 19 Aug 2024 05:32:05 -0700


    [ 
https://issues.apache.org/jira/browse/CSV-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874870#comment-17874870
 ]


Gary D. Gregory commented on CSV-293:
-------------------------------------

Hello [~paulmillar]

Thank you for your comments and for providing insightful details.

{quote}
For example, this could be represented by an empty string, by some white-space, 
by a single dash \(-), etc. 
{quote}

In general, from file to file, or within a single file? If it's within a single 
file... yikes. If the setting differs from file to file, I would use a 
different CSVFormat specialized for each.

The breaking of roundtripping (for lack of a better term) is not great. I am 
more concerned with introducing multi-valued settings. Adding to my previous 
statement: I don't count "header comments" and "headers" since those occur once 
per file instead of possibly once per line or value, as a null definition 
would; furthermore, these two are not an issue in a roundtrip.

I think we (or at least I) have considered refactoring or creating a new 
hierarchy where you'd use a {{CSVReadFormat}} and {{CSVWriteFormat}} with an 
abstract superclass but that does not seem workable in general because 
different formats might provide settings that conflict with each other (maybe 
setting Foo is a read-only setting in format A and a write-only setting in 
format B).

{quote}
Going back to my use-case, I'm also wondering whether this might be a specific 
example of a more general concept of data normalisation; for example, a 
particular field might be case insensitive, but a CSV file might contain a 
mixture of upper- and lower-case values. A data normalisation step might 
convert all such values to their lower-case equivalent.
{quote}

We've avoided data processing in the past and pointed to using JDBC, SQL, and 
whatever JDBC Driver might match specific data processing needs. IMO, this has 
been the right decision for this component, especially considering its small 
and simple footprint. That said, I could imagine (maybe) allowing a lambda to 
be plugged in here or there, to allow for some processing customization. Such a 
feature, shouldn't degrade performance though. 

If providing pluggable processing through lambdas is not enough, an alternative 
would be to allow for subclassing and perhaps refactor the code into more 
protected methods a custom subclass could use. 


> Add support for multiple null String values
> -------------------------------------------
>
>                 Key: CSV-293
>                 URL: https://issues.apache.org/jira/browse/CSV-293
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>            Reporter: Paul Millar
>            Priority: Minor
>
> The [CSVW namespace|https://www.w3.org/ns/csvw] provides metadata describing 
> a CSV file.  One element of this is the ability to associate certain certain 
> values with the {{null}} value, as recorded by the [csvw:null 
> property|https://www.w3.org/ns/csvw#property-definitions].
> This definition corresponds (broadly) to the "null String" concept (see 
> [org.apache.commons.csv.CSVFormat#setNullString|http://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.Builder.html#setNullString-java.lang.String-]),
>  with one noticeable difference: {{CSVFormat}} supports only a single "null 
> String" value while CSVW, through {{csvw:null}}, supports multiple Strings.
> In order to fully support CSVW, it would be helpful if {{CSVFormat}} were to 
> be updated to allow multiple null String values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CSV-293) Add support for multiple null String values

Reply via email to