[jira] [Commented] (ANY23-413) CSV Extractor attempts to be too smart

Hans Brende (JIRA) Tue, 30 Oct 2018 19:22:01 -0700


    [ 
https://issues.apache.org/jira/browse/ANY23-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669508#comment-16669508
 ]


Hans Brende commented on ANY23-413:
-----------------------------------

Also see: 
https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

> CSV Extractor attempts to be too smart
> --------------------------------------
>
>                 Key: ANY23-413
>                 URL: https://issues.apache.org/jira/browse/ANY23-413
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: extractors
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Priority: Minor
>             Fix For: 2.3
>
>
> Currently, our CSV extractor tries to figure out what the datatype of each 
> cell is simply by attempting to parse a float or integer from the cell and 
> falling back on "string".
> This is problematic because cells that look like numbers may not, in fact, be 
> numbers.
> Consider a column of version numbers, such as:
> 4
> 4.1
> 4.1.1
> etc.
> Currently our csv extractor will assign the following datatypes to this 
> column:
> 4 -> integer
> 4.1 -> float
> 4.1.1 -> string
> We could improve this guessing ability by parsing the entire column before 
> assigning a datatype, and then using the least-specific datatype encountered. 
> However, this solution would also be problematic because then we'd have to 
> hold the entire table in memory before generating any triples. And it still 
> wouldn't guarantee correctness.
> Without structured data telling us what the original datatype was, I don't 
> think assigning any datatypes other than "string" to string values is 
> worthwhile.
> Another problem is that the extractor strips leading and trailing whitespaces 
> from all values, including string values. While this behavior probably 
> wouldn't present a problem for most use-cases, it does mean that the 
> algorithm is lossy.
> Cf. ANY23-218



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ANY23-413) CSV Extractor attempts to be too smart

Reply via email to