[
https://issues.apache.org/jira/browse/ANY23-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669508#comment-16669508
]
Hans Brende commented on ANY23-413:
-----------------------------------
Also see:
https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164
> CSV Extractor attempts to be too smart
> --------------------------------------
>
> Key: ANY23-413
> URL: https://issues.apache.org/jira/browse/ANY23-413
> Project: Apache Any23
> Issue Type: Bug
> Components: extractors
> Affects Versions: 2.3
> Reporter: Hans Brende
> Priority: Minor
> Fix For: 2.3
>
>
> Currently, our CSV extractor tries to figure out what the datatype of each
> cell is simply by attempting to parse a float or integer from the cell and
> falling back on "string".
> This is problematic because cells that look like numbers may not, in fact, be
> numbers.
> Consider a column of version numbers, such as:
> 4
> 4.1
> 4.1.1
> etc.
> Currently our csv extractor will assign the following datatypes to this
> column:
> 4 -> integer
> 4.1 -> float
> 4.1.1 -> string
> We could improve this guessing ability by parsing the entire column before
> assigning a datatype, and then using the least-specific datatype encountered.
> However, this solution would also be problematic because then we'd have to
> hold the entire table in memory before generating any triples. And it still
> wouldn't guarantee correctness.
> Without structured data telling us what the original datatype was, I don't
> think assigning any datatypes other than "string" to string values is
> worthwhile.
> Another problem is that the extractor strips leading and trailing whitespaces
> from all values, including string values. While this behavior probably
> wouldn't present a problem for most use-cases, it does mean that the
> algorithm is lossy.
> Cf. ANY23-218
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)