[jira] [Created] (ANY23-413) CSV Extractor attempts to be too smart

Hans Brende (JIRA) Mon, 29 Oct 2018 10:25:27 -0700

Hans Brende created ANY23-413:
---------------------------------

             Summary: CSV Extractor attempts to be too smart
                 Key: ANY23-413
                 URL: https://issues.apache.org/jira/browse/ANY23-413
             Project: Apache Any23
          Issue Type: Bug
          Components: extractors
    Affects Versions: 2.3
            Reporter: Hans Brende
             Fix For: 2.3



Currently, our CSV extractor tries to figure out what the datatype of each cell 
is simply by attempting to parse a float or integer from the cell and falling 
back on "string".

This is problematic because cells that look like numbers may not, in fact, be 
numbers.

Consider a column of version numbers, such as:
4
4.1
4.1.1
etc.

Currently our csv extractor will assign the following datatypes to this column:
4 -> integer
4.1 -> float
4.1.1 -> string

We could improve this guessing ability by parsing the entire column before 
assigning a datatype, and then using the least-specific datatype encountered. 
However, this solution would also be problematic because then we'd have to hold 
the entire table in memory before generating any triples. And it still wouldn't 
guarantee correctness.

Without structured data telling us what the original datatype was, I don't 
think assigning any datatypes other than "string" to string values is 
worthwhile.

Another problem is that the extractor strips leading and trailing whitespaces 
from all values, including string values. While this behavior probably wouldn't 
present a problem for most use-cases, it does mean that the algorithm is lossy.

Cf. ANY23-218





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ANY23-413) CSV Extractor attempts to be too smart

Reply via email to