Hello Antoine and Wes,

really excited to see this happen. CSVs and co are the file formats you never 
get rid of, so it is really important to have an Arrow reader. Concerning the 
custom implementation I can further back this as during the parquet_arrow 
reader, I have spent quite some amount of building custom, optimized paths that 
produce Arrow columns directly instead of using a more parquet-native 
intermediate. For example, the methods postfixed with *spaced in parquet-cpp 
brought a 2-4x improvement in the read performance in contrast to the more 
general implementations.

Uwe

On Fri, Aug 17, 2018, at 10:33 PM, Wes McKinney wrote:
> hi all,
> 
> Early in the project I created the issue
> 
> https://issues.apache.org/jira/browse/ARROW-25
> 
> about creating a high performance CSV file reader that returns Arrow
> record batches. Many data systems have invested significant energies
> in solving this problem, so why would we build Yet Another CSV Reader
> in Apache Arrow? I originally wrote pandas.read_csv, for example.
> 
> Well, there are in fact some really good reasons.
> 
> 1) There has been a number of advances in designs for CSV readers to
> leverage multiple cores for better performance, as an example
> 
> * https://github.com/wiseio/paratext
> * the datatable::fread function in R
> 
> and others. Many existing CSV readers can and should be rearchitected
> to take advantage of these designs.
> 
> 2) The hot paths in CSV parsing tend to be highly particular to the
> target data structures. Utilizing intermediate data structures hurts
> performance in a meaningful way. Also, the orientation (columnar vs.
> non-columnar) impacts the general design of the computational hot
> paths
> 
> Other computational choices, such as how to handle erroneous values or
> nulls, or whether to dictionary-encode string columns (such as using
> Arrow's dictionary encoding) has an impact on design as well.
> 
> Thus, the highest performance CSV reader must be specialized to the
> Arrow columnar layout in its hot paths.
> 
> 3) Many applications spend a lot of their time converting text files
> into tables. So solving the problem well pays long term dividends
> 
> 4) As a development platform, solving the problem well in Apache Arrow
> will enable many downstream consumers to profit from performance and
> IO gains, and having this critical piece of shared infrastructure in a
> community project will drive contributions back upstream into Arrow.
> For example, we could use this easily in Python, R, and Ruby.
> 
> 5) By building inside Arrow we can utilize common interfaces for IO
> and concurrency: file system APIs, memory management (and taking
> advantage of our jemalloc infrastructure [1]), on-the-fly
> decompression, asynchronous / buffering input streams, thread
> management, and others.
> 
> There's probably some other reasons, but these are the main ones I think 
> about.
> 
> I spoke briefly about the project with Antoine and he has started
> putting together the start of a reader in the C++ codebase:
> 
> https://github.com/pitrou/arrow/tree/csv_reader
> 
> I'm excited for this project to get off the ground as it will have a
> lot of user-visible impact and pay dividends for many years. It would
> be great for those who have worked on fast CSV parsing to share their
> experiences and get involved to help make good design choices and take
> advantage of lessons learned in other projects
> 
> - Wes
> 
> [1]: http://arrow.apache.org/blog/2018/07/20/jemalloc/

Reply via email to