taotao li created ARROW-7043: -------------------------------- Summary: pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv Key: ARROW-7043 URL: https://issues.apache.org/jira/browse/ARROW-7043 Project: Apache Arrow Issue Type: Test Components: Python Affects Versions: 0.15.0 Reporter: taotao li
Hi, thanks great for building Arrow firstly, I find this project from wes's post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] his ambition on building arrow for fixing problems in pandas really attract my eyes. bellow is my problems: background: * Our team's analytic work deeply rely on pandas, we often read large csv files into memory and do kinds of analytic work. * We have faced problems which mentioned in wes's post, espcially `pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` * We are looking for some technics which can help us on load our csv(or other format, like msgpack, parquet, or something else), using as little as memory. experiment: * luckily I find arrow, and I did a simple test. * input file: a 1.5GB csv file, around 6 million records, 15 columns; * using pandas bellow: * {code:java} import pandas as pd{code} * * {code:java} {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)