Nicholas Chammas created SPARK-47180:
----------------------------------------
Summary: Migrate CSV parsing off of Univocity
Key: SPARK-47180
URL: https://issues.apache.org/jira/browse/SPARK-47180
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.0.0
Reporter: Nicholas Chammas
Univocity appears to be unmaintained.
As of February 2024:
* The last release was [more than 3 years
ago|https://github.com/uniVocity/univocity-parsers/releases].
* The last commit to {{master}} was [almost 3 years
ago|https://github.com/uniVocity/univocity-parsers/commits/master/].
* The website is
[down|https://github.com/uniVocity/univocity-parsers/issues/506].
* There are
[multiple|https://github.com/uniVocity/univocity-parsers/issues/494]
[open|https://github.com/uniVocity/univocity-parsers/issues/495]
[bugs|https://github.com/uniVocity/univocity-parsers/issues/499] on the tracker
with no indication that anyone cares.
It's not urgent, but we should consider migrating to an actively maintained CSV
library in the JVM ecosystem.
There are a bunch of libraries [listed here on this Maven
Repository|https://mvnrepository.com/open-source/csv-libraries].
[jackson-dataformats-text|https://github.com/FasterXML/jackson-dataformats-text]
looks interesting. I know we already use FasterXML to parse JSON. Perhaps we
should use them to parse CSV as well.
I'm guessing we chose Univocity back in the day because it was the fastest CSV
library on the JVM. However, the last performance benchmark comparing it to
others was [from February
2018|https://github.com/uniVocity/csv-parsers-comparison/blob/5548b52f2cc27eb19c11464e9a331491e8ad4ba6/README.md#statistics-updated-28th-of-february-2018],
so this may no longer be true.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]