[jira] [Commented] (ANY23-413) CSV Extractor attempts to be too smart

Hans Brende (JIRA) Tue, 30 Oct 2018 18:56:36 -0700


    [ 
https://issues.apache.org/jira/browse/ANY23-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669492#comment-16669492
 ]


Hans Brende commented on ANY23-413:
-----------------------------------

One possible fix for this would be to:

(1) Buffer a few kilobytes worth of input before writing out the first triple. 
(Which we're already doing to detect the column-separator character).
(2) Use the *least* specific possible datatype found in the buffered input for 
a column as the *most* specific datatype we will assign to items in that column.
(3) If we don't get enough representative samples for a column in the few 
kilobytes that we buffer to be reasonably confident in our choice of datatype, 
fall back to string.

> CSV Extractor attempts to be too smart
> --------------------------------------
>
>                 Key: ANY23-413
>                 URL: https://issues.apache.org/jira/browse/ANY23-413
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: extractors
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Priority: Minor
>             Fix For: 2.3
>
>
> Currently, our CSV extractor tries to figure out what the datatype of each 
> cell is simply by attempting to parse a float or integer from the cell and 
> falling back on "string".
> This is problematic because cells that look like numbers may not, in fact, be 
> numbers.
> Consider a column of version numbers, such as:
> 4
> 4.1
> 4.1.1
> etc.
> Currently our csv extractor will assign the following datatypes to this 
> column:
> 4 -> integer
> 4.1 -> float
> 4.1.1 -> string
> We could improve this guessing ability by parsing the entire column before 
> assigning a datatype, and then using the least-specific datatype encountered. 
> However, this solution would also be problematic because then we'd have to 
> hold the entire table in memory before generating any triples. And it still 
> wouldn't guarantee correctness.
> Without structured data telling us what the original datatype was, I don't 
> think assigning any datatypes other than "string" to string values is 
> worthwhile.
> Another problem is that the extractor strips leading and trailing whitespaces 
> from all values, including string values. While this behavior probably 
> wouldn't present a problem for most use-cases, it does mean that the 
> algorithm is lossy.
> Cf. ANY23-218



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ANY23-413) CSV Extractor attempts to be too smart

Reply via email to