[ 
https://issues.apache.org/jira/browse/CSV-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874954#comment-17874954
 ] 

Paul Millar commented on CSV-293:
---------------------------------

Hi [~ggregory],

I dug out an old laptop from this time and found some more context.

The trigger for this issue was from my working with [RML|http://rml.io], which 
is a language for describing how to map different structured and 
semi-structured source  information (JSON, CSV, XML, ...) to an equivalent 
semantic web description.  I was interested in using RML to describe how to 
convert data about EC funded projects (provided as CSV files) to an equivalent 
RDF-based description.

The RML project includes some tools to support the language, such as [RML 
mapper|https://github.com/RMLio/rmlmapper-java].  RML mapper is written in Java 
and uses Commons CSV (or open csv?) to support parsing CSV files as input data 
for the mapping.

Back in 2021, RMLmapper support some of CSVW, but lacked support for 
{{csvw:null}}.  I created a [GitHub pull 
request|https://github.com/RMLio/rmlmapper-java/pull/138] that added partial 
support for {{csvw:null}}.  The commit message mentioned the lack of support 
for multiple null values and  I opened this issue as a way to report this 
observation "upstream".

To be honest, I don't remember whether the EC-funding CSV files have multiple 
null-like values.  It's possible each CSV file contained only a single null 
value.  However, I've worked with other data sources that suffered from having 
multiple null-like String values, so I think supporting multiple null values in 
CSVW makes sense.

>From the Commons CSV [Changes 
>Report|https://commons.apache.org/proper/commons-csv/changes-report.html], 
>support for {{CSVParser#stream}} was added in 2021-07-24 with v1.9.0 (by 
>yourself, no less!).

Unfortunately, at that time, I wasn't aware of this new feature.

However, given the ability for Commons CSV to provide a {{Stream}}, I'd suggest 
that any data normalisation step (including support for multiple null values, 
per CSVW) would most naturally be achieved in the application, by taking 
advantage of {{CSVParser#stream}} along with Java 8's functional programming 
support ({{Stream#map}} and possibly {{Stream#flatMap}}).

Therefore, given the support for Stream, I don't think there's a need to update 
the Commons CSV API to support an application injecting some kind of lambda.  
Also, I think it would be reasonable to close this issue with the 
recommendation to use {{CSVParser#stream}} as an easy and flexible way to 
post-process parsed data.

> Add support for multiple null String values
> -------------------------------------------
>
>                 Key: CSV-293
>                 URL: https://issues.apache.org/jira/browse/CSV-293
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>            Reporter: Paul Millar
>            Priority: Minor
>
> The [CSVW namespace|https://www.w3.org/ns/csvw] provides metadata describing 
> a CSV file.  One element of this is the ability to associate certain certain 
> values with the {{null}} value, as recorded by the [csvw:null 
> property|https://www.w3.org/ns/csvw#property-definitions].
> This definition corresponds (broadly) to the "null String" concept (see 
> [org.apache.commons.csv.CSVFormat#setNullString|http://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.Builder.html#setNullString-java.lang.String-]),
>  with one noticeable difference: {{CSVFormat}} supports only a single "null 
> String" value while CSVW, through {{csvw:null}}, supports multiple Strings.
> In order to fully support CSVW, it would be helpful if {{CSVFormat}} were to 
> be updated to allow multiple null String values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to