[
https://issues.apache.org/jira/browse/ANY23-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669492#comment-16669492
]
Hans Brende commented on ANY23-413:
-----------------------------------
One possible fix for this would be to:
(1) Buffer a few kilobytes worth of input before writing out the first triple.
(Which we're already doing to detect the column-separator character).
(2) Use the *least* specific possible datatype found in the buffered input for
a column as the *most* specific datatype we will assign to items in that column.
(3) If we don't get enough representative samples for a column in the few
kilobytes that we buffer to be reasonably confident in our choice of datatype,
fall back to string.
> CSV Extractor attempts to be too smart
> --------------------------------------
>
> Key: ANY23-413
> URL: https://issues.apache.org/jira/browse/ANY23-413
> Project: Apache Any23
> Issue Type: Bug
> Components: extractors
> Affects Versions: 2.3
> Reporter: Hans Brende
> Priority: Minor
> Fix For: 2.3
>
>
> Currently, our CSV extractor tries to figure out what the datatype of each
> cell is simply by attempting to parse a float or integer from the cell and
> falling back on "string".
> This is problematic because cells that look like numbers may not, in fact, be
> numbers.
> Consider a column of version numbers, such as:
> 4
> 4.1
> 4.1.1
> etc.
> Currently our csv extractor will assign the following datatypes to this
> column:
> 4 -> integer
> 4.1 -> float
> 4.1.1 -> string
> We could improve this guessing ability by parsing the entire column before
> assigning a datatype, and then using the least-specific datatype encountered.
> However, this solution would also be problematic because then we'd have to
> hold the entire table in memory before generating any triples. And it still
> wouldn't guarantee correctness.
> Without structured data telling us what the original datatype was, I don't
> think assigning any datatypes other than "string" to string values is
> worthwhile.
> Another problem is that the extractor strips leading and trailing whitespaces
> from all values, including string values. While this behavior probably
> wouldn't present a problem for most use-cases, it does mean that the
> algorithm is lossy.
> Cf. ANY23-218
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)