[
https://issues.apache.org/jira/browse/ARROW-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoine Pitrou reassigned ARROW-25:
-----------------------------------
Assignee: Antoine Pitrou
> [C++] Implement delimited file scanner / CSV reader
> ---------------------------------------------------
>
> Key: ARROW-25
> URL: https://issues.apache.org/jira/browse/ARROW-25
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Assignee: Antoine Pitrou
> Priority: Major
>
> Like Parquet and binary file formats, text files will be an important data
> medium for converting to and from in-memory Arrow data.
> pandas has some (Apache-compatible) business logic we can learn from here (as
> one of the gold-standard CSV readers in production use)
> https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h
> https://github.com/pydata/pandas/blob/master/pandas/parser.pyx
> While very fast, this this should be largely written from scratch to target
> the Arrow memory layout, but we can reuse certain aspects like the tokenizer
> DFA (which originally came from the Python interpreter csv module
> implementation)
> https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)