[ 
https://issues.apache.org/jira/browse/ARROW-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238025#comment-16238025
 ] 

Arthur Maciejewicz commented on ARROW-25:
-----------------------------------------

The files have moved, this is current on master as of Friday Nov. 3, 2017:

* 
https://github.com/pandas-dev/pandas/blob/27bbea7ee125f4dc19dca2a7703c9a13ca754f9b/pandas/_libs/parsers.pyx
* 
https://github.com/pandas-dev/pandas/blob/27bbea7ee125f4dc19dca2a7703c9a13ca754f9b/pandas/_libs/src/parser/tokenizer.h
* 
https://github.com/pandas-dev/pandas/blob/27bbea7ee125f4dc19dca2a7703c9a13ca754f9b/pandas/_libs/src/parser/tokenizer.c

With the mentioned tokenizer DFA located at:

https://github.com/pandas-dev/pandas/blob/27bbea7ee125f4dc19dca2a7703c9a13ca754f9b/pandas/_libs/src/parser/tokenizer.c#L757-L1112

> C++: Implement delimited file scanner / CSV reader
> --------------------------------------------------
>
>                 Key: ARROW-25
>                 URL: https://issues.apache.org/jira/browse/ARROW-25
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>
> Like Parquet and binary file formats, text files will be an important data 
> medium for converting to and from in-memory Arrow data. 
> pandas has some (Apache-compatible) business logic we can learn from here (as 
> one of the gold-standard CSV readers in production use)
> https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h
> https://github.com/pydata/pandas/blob/master/pandas/parser.pyx
> While very fast, this this should be largely written from scratch to target 
> the Arrow memory layout, but we can reuse certain aspects like the tokenizer 
> DFA (which originally came from the Python interpreter csv module 
> implementation)
> https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to