[
https://issues.apache.org/jira/browse/CSV-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874954#comment-17874954
]
Paul Millar commented on CSV-293:
---------------------------------
Hi [~ggregory],
I dug out an old laptop from this time and found some more context.
The trigger for this issue was from my working with [RML|http://rml.io], which
is a language for describing how to map different structured and
semi-structured source information (JSON, CSV, XML, ...) to an equivalent
semantic web description. I was interested in using RML to describe how to
convert data about EC funded projects (provided as CSV files) to an equivalent
RDF-based description.
The RML project includes some tools to support the language, such as [RML
mapper|https://github.com/RMLio/rmlmapper-java]. RML mapper is written in Java
and uses Commons CSV (or open csv?) to support parsing CSV files as input data
for the mapping.
Back in 2021, RMLmapper support some of CSVW, but lacked support for
{{csvw:null}}. I created a [GitHub pull
request|https://github.com/RMLio/rmlmapper-java/pull/138] that added partial
support for {{csvw:null}}. The commit message mentioned the lack of support
for multiple null values and I opened this issue as a way to report this
observation "upstream".
To be honest, I don't remember whether the EC-funding CSV files have multiple
null-like values. It's possible each CSV file contained only a single null
value. However, I've worked with other data sources that suffered from having
multiple null-like String values, so I think supporting multiple null values in
CSVW makes sense.
>From the Commons CSV [Changes
>Report|https://commons.apache.org/proper/commons-csv/changes-report.html],
>support for {{CSVParser#stream}} was added in 2021-07-24 with v1.9.0 (by
>yourself, no less!).
Unfortunately, at that time, I wasn't aware of this new feature.
However, given the ability for Commons CSV to provide a {{Stream}}, I'd suggest
that any data normalisation step (including support for multiple null values,
per CSVW) would most naturally be achieved in the application, by taking
advantage of {{CSVParser#stream}} along with Java 8's functional programming
support ({{Stream#map}} and possibly {{Stream#flatMap}}).
Therefore, given the support for Stream, I don't think there's a need to update
the Commons CSV API to support an application injecting some kind of lambda.
Also, I think it would be reasonable to close this issue with the
recommendation to use {{CSVParser#stream}} as an easy and flexible way to
post-process parsed data.
> Add support for multiple null String values
> -------------------------------------------
>
> Key: CSV-293
> URL: https://issues.apache.org/jira/browse/CSV-293
> Project: Commons CSV
> Issue Type: Improvement
> Components: Parser
> Reporter: Paul Millar
> Priority: Minor
>
> The [CSVW namespace|https://www.w3.org/ns/csvw] provides metadata describing
> a CSV file. One element of this is the ability to associate certain certain
> values with the {{null}} value, as recorded by the [csvw:null
> property|https://www.w3.org/ns/csvw#property-definitions].
> This definition corresponds (broadly) to the "null String" concept (see
> [org.apache.commons.csv.CSVFormat#setNullString|http://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.Builder.html#setNullString-java.lang.String-]),
> with one noticeable difference: {{CSVFormat}} supports only a single "null
> String" value while CSVW, through {{csvw:null}}, supports multiple Strings.
> In order to fully support CSVW, it would be helpful if {{CSVFormat}} were to
> be updated to allow multiple null String values.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)