Steven Anton created ARROW-1374:
-----------------------------------
Summary: Compatibility with xgboost
Key: ARROW-1374
URL: https://issues.apache.org/jira/browse/ARROW-1374
Project: Apache Arrow
Issue Type: Wish
Reporter: Steven Anton
Priority: Minor
Traditionally I work with CSV's and really suffer with slow read/write times.
Parquet and the Arrow project obviously give us huge speedups.
One thing I've noticed, however, is that there is a serious bottleneck when
converting a DataFrame read in through pyarrow to a DMatrix used by xgboost.
For example, I'm building a model with about 180k rows and 6k float64 columns.
Reading into a pandas DataFrame takes about 20 seconds on my machine. However,
converting that DataFrame to a DMatrix takes well over 10 minutes.
Interestingly, it takes about 10 minutes to read that same data from a CSV into
a pandas DataFrame. Then, it takes less than a minute to convert to a DMatrix.
I'm sure there's a good technical explanation for why this happens (e.g. row vs
column storage). Still, I imagine this use case may occur to many and it would
be great to improve these times, if possible.
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import xgboost as xgb
# Reading from parquet:
table = pq.read_table('/path/to/parquet/files') # 20 seconds
variables = table.to_pandas() # 1 second
dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])
# takes 10-15 minutes
# Reading from CSV:
variables = pd.read_csv('/path/to/file.csv', ...) # takes about 10 minutes
dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])
# less than 1 minute
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)