Hans Brende created ANY23-413:
---------------------------------
Summary: CSV Extractor attempts to be too smart
Key: ANY23-413
URL: https://issues.apache.org/jira/browse/ANY23-413
Project: Apache Any23
Issue Type: Bug
Components: extractors
Affects Versions: 2.3
Reporter: Hans Brende
Fix For: 2.3
Currently, our CSV extractor tries to figure out what the datatype of each cell
is simply by attempting to parse a float or integer from the cell and falling
back on "string".
This is problematic because cells that look like numbers may not, in fact, be
numbers.
Consider a column of version numbers, such as:
4
4.1
4.1.1
etc.
Currently our csv extractor will assign the following datatypes to this column:
4 -> integer
4.1 -> float
4.1.1 -> string
We could improve this guessing ability by parsing the entire column before
assigning a datatype, and then using the least-specific datatype encountered.
However, this solution would also be problematic because then we'd have to hold
the entire table in memory before generating any triples. And it still wouldn't
guarantee correctness.
Without structured data telling us what the original datatype was, I don't
think assigning any datatypes other than "string" to string values is
worthwhile.
Another problem is that the extractor strips leading and trailing whitespaces
from all values, including string values. While this behavior probably wouldn't
present a problem for most use-cases, it does mean that the algorithm is lossy.
Cf. ANY23-218
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)