Wes McKinney created ARROW-25:
---------------------------------

             Summary: C++: Start a delimited file tokenizer-converter
                 Key: ARROW-25
                 URL: https://issues.apache.org/jira/browse/ARROW-25
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++
            Reporter: Wes McKinney


Like Parquet and binary file formats, text files will be an important data 
medium for converting to and from in-memory Arrow data. 

pandas has some (Apache-compatible) business logic we can learn from here (as 
one of the gold-standard CSV readers in production use)

https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h
https://github.com/pydata/pandas/blob/master/pandas/parser.pyx

While very fast, this this should be largely written from scratch to target the 
Arrow memory layout, but we can reuse certain aspects like the tokenizer DFA 
(which originally came from the Python interpreter csv module implementation)

https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to