[
https://issues.apache.org/jira/browse/CSV-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874870#comment-17874870
]
Gary D. Gregory commented on CSV-293:
-------------------------------------
Hello [~paulmillar]
Thank you for your comments and for providing insightful details.
{quote}
For example, this could be represented by an empty string, by some white-space,
by a single dash \(-), etc.
{quote}
In general, from file to file, or within a single file? If it's within a single
file... yikes. If the setting differs from file to file, I would use a
different CSVFormat specialized for each.
The breaking of roundtripping (for lack of a better term) is not great. I am
more concerned with introducing multi-valued settings. Adding to my previous
statement: I don't count "header comments" and "headers" since those occur once
per file instead of possibly once per line or value, as a null definition
would; furthermore, these two are not an issue in a roundtrip.
I think we (or at least I) have considered refactoring or creating a new
hierarchy where you'd use a {{CSVReadFormat}} and {{CSVWriteFormat}} with an
abstract superclass but that does not seem workable in general because
different formats might provide settings that conflict with each other (maybe
setting Foo is a read-only setting in format A and a write-only setting in
format B).
{quote}
Going back to my use-case, I'm also wondering whether this might be a specific
example of a more general concept of data normalisation; for example, a
particular field might be case insensitive, but a CSV file might contain a
mixture of upper- and lower-case values. A data normalisation step might
convert all such values to their lower-case equivalent.
{quote}
We've avoided data processing in the past and pointed to using JDBC, SQL, and
whatever JDBC Driver might match specific data processing needs. IMO, this has
been the right decision for this component, especially considering its small
and simple footprint. That said, I could imagine (maybe) allowing a lambda to
be plugged in here or there, to allow for some processing customization. Such a
feature, shouldn't degrade performance though.
If providing pluggable processing through lambdas is not enough, an alternative
would be to allow for subclassing and perhaps refactor the code into more
protected methods a custom subclass could use.
> Add support for multiple null String values
> -------------------------------------------
>
> Key: CSV-293
> URL: https://issues.apache.org/jira/browse/CSV-293
> Project: Commons CSV
> Issue Type: Improvement
> Components: Parser
> Reporter: Paul Millar
> Priority: Minor
>
> The [CSVW namespace|https://www.w3.org/ns/csvw] provides metadata describing
> a CSV file. One element of this is the ability to associate certain certain
> values with the {{null}} value, as recorded by the [csvw:null
> property|https://www.w3.org/ns/csvw#property-definitions].
> This definition corresponds (broadly) to the "null String" concept (see
> [org.apache.commons.csv.CSVFormat#setNullString|http://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.Builder.html#setNullString-java.lang.String-]),
> with one noticeable difference: {{CSVFormat}} supports only a single "null
> String" value while CSVW, through {{csvw:null}}, supports multiple Strings.
> In order to fully support CSVW, it would be helpful if {{CSVFormat}} were to
> be updated to allow multiple null String values.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)