[
https://issues.apache.org/jira/browse/ARROW-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133860#comment-16133860
]
Steven Anton commented on ARROW-1374:
-------------------------------------
Thanks! Ok, I got the conversion time down to 52 seconds by following your
suggestions. It sounds like there aren't really any changes needed. I had just
misunderstood what was going on under the hood. Maybe some documentation or a
"gotchas" section?
Here's what I did differently:
{code:none}
schema = pq.read_schema('/path/to/file.parq')
columns = [x for x in schema.names if x not in ['tag', ...]]
variables = pq.read_table(str(parquet_path), columns=columns).to_pandas()
# It looks like I could have used the .remove_column method instead of the above
tag = pq.read_table(str(parquet_path), columns=['is_fraud']).to_pandas()
cProfile.run('xgb.DMatrix(variables, label=tag)')
{code}
And the output of cProfile is below, just for reference.
{noformat}
145353 function calls (145344 primitive calls) in 51.970 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
7 0.000 0.000 0.000 0.000 <frozen
importlib._bootstrap>:996(_handle_fromlist)
1 0.030 0.030 51.970 51.970 <string>:1(<module>)
4 0.000 0.000 0.002 0.000 __init__.py:357(__getattr__)
4 0.002 0.000 0.002 0.000 __init__.py:364(__getitem__)
1 0.000 0.000 0.000 0.000 __init__.py:483(cast)
1 0.000 0.000 0.000 0.000 _internal.py:225(__init__)
1 0.000 0.000 0.000 0.000 _internal.py:237(data_as)
1 0.000 0.000 0.000 0.000 _methods.py:37(_any)
2 0.000 0.000 0.000 0.000
algorithms.py:1342(_get_take_nd_function)
2 0.000 0.000 0.001 0.000 algorithms.py:1375(take_nd)
2 0.000 0.000 0.000 0.000 base.py:1578(is_all_dates)
1 0.000 0.000 0.114 0.114 base.py:1884(format)
1 0.000 0.000 0.114 0.114 base.py:1899(_format_with_header)
1 0.010 0.010 0.113 0.113 base.py:1910(<listcomp>)
2 0.000 0.000 0.000 0.000 base.py:3999(_ensure_index)
7 0.000 0.000 0.000 0.000 base.py:528(__len__)
3 0.000 0.000 0.000 0.000 base.py:559(values)
2 0.000 0.000 0.000 0.000 cast.py:759(maybe_castable)
2 0.000 0.000 0.000 0.000
cast.py:868(maybe_cast_to_datetime)
2 0.000 0.000 0.000 0.000 common.py:117(is_sparse)
1 0.000 0.000 0.000 0.000
common.py:1419(is_string_like_dtype)
2 0.000 0.000 0.000 0.000 common.py:1456(is_float_dtype)
2 0.000 0.000 0.000 0.000 common.py:1549(is_extension_type)
2 0.000 0.000 0.000 0.000 common.py:1673(_get_dtype)
16 0.000 0.000 0.000 0.000 common.py:1722(_get_dtype_type)
4 0.000 0.000 0.000 0.000 common.py:1852(pandas_dtype)
6 0.000 0.000 0.000 0.000 common.py:190(is_categorical)
9 0.000 0.000 0.000 0.000 common.py:222(is_datetimetz)
7 0.000 0.000 0.000 0.000 common.py:296(is_datetime64_dtype)
14 0.000 0.000 0.000 0.000
common.py:333(is_datetime64tz_dtype)
5 0.000 0.000 0.000 0.000
common.py:371(is_timedelta64_dtype)
1 0.000 0.000 0.000 0.000 common.py:406(is_period_dtype)
3 0.000 0.000 0.000 0.000 common.py:439(is_interval_dtype)
8 0.000 0.000 0.000 0.000
common.py:475(is_categorical_dtype)
1 0.000 0.000 0.000 0.000 common.py:508(is_string_dtype)
3 0.000 0.000 0.000 0.000 common.py:609(is_datetimelike)
4 0.000 0.000 0.000 0.000 common.py:84(is_object_dtype)
5 0.000 0.000 0.000 0.000 core.py:115(_check_call)
1 0.000 0.000 0.000 0.000 core.py:152(c_str)
1 0.150 0.150 0.150 0.150 core.py:157(c_array)
1 0.000 0.000 6.442 6.442 core.py:168(_maybe_pandas_data)
6026 0.008 0.000 0.008 0.000 core.py:175(<genexpr>)
1 0.003 0.003 0.003 0.003 core.py:187(<listcomp>)
1 0.000 0.000 0.001 0.001 core.py:194(_maybe_pandas_label)
2 0.000 0.000 0.000 0.000 core.py:202(<genexpr>)
1 0.000 0.000 51.890 51.890 core.py:222(__init__)
1 27.752 27.752 45.272 45.272 core.py:309(_init_from_npy2d)
1 0.049 0.049 0.050 0.050 core.py:323(__del__)
1 0.001 0.001 0.152 0.152 core.py:368(set_float_info)
1 0.000 0.000 0.152 0.152 core.py:414(set_label)
2 0.000 0.000 0.001 0.000 core.py:501(num_col)
1 0.004 0.004 0.021 0.021 core.py:557(feature_names)
6026 0.007 0.000 0.015 0.000 core.py:576(<genexpr>)
24100 0.004 0.000 0.004 0.000 core.py:577(<genexpr>)
1 0.000 0.000 0.003 0.003 core.py:585(feature_types)
6026 0.002 0.000 0.002 0.000 core.py:614(<genexpr>)
1 0.000 0.000 0.000 0.000 dtypes.py:367(is_dtype)
3 0.000 0.000 0.000 0.000 dtypes.py:489(is_dtype)
26 0.000 0.000 0.000 0.000 dtypes.py:84(is_dtype)
2 0.000 0.000 0.000 0.000 generic.py:117(__init__)
2 0.000 0.000 0.000 0.000 generic.py:145(_validate_dtype)
2 0.000 0.000 0.000 0.000 generic.py:3067(__getattr__)
4 0.000 0.000 0.000 0.000 generic.py:3083(__setattr__)
2 0.000 0.000 0.000 0.000
generic.py:3122(_protect_consolidate)
2 0.000 0.000 0.000 0.000
generic.py:3132(_consolidate_inplace)
2 0.000 0.000 0.000 0.000 generic.py:3135(f)
2 0.000 0.000 0.000 0.000 generic.py:3214(as_matrix)
2 0.000 0.000 0.000 0.000 generic.py:3256(values)
2 0.000 0.000 0.002 0.001 generic.py:3298(dtypes)
2 0.000 0.000 0.000 0.000 generic.py:416(_info_axis)
25 0.000 0.000 0.000 0.000 generic.py:7(_check)
6025 0.009 0.000 0.014 0.000 inference.py:396(is_sequence)
2 0.000 0.000 0.000 0.000 internals.py:102(__init__)
3 0.000 0.000 0.000 0.000 internals.py:154(internal_values)
2 0.000 0.000 0.000 0.000 internals.py:160(get_values)
2 0.000 0.000 0.000 0.000 internals.py:1838(__init__)
6 0.000 0.000 0.000 0.000 internals.py:185(mgr_locs)
2 0.000 0.000 0.000 0.000 internals.py:222(mgr_locs)
2 0.000 0.000 0.000 0.000 internals.py:2683(make_block)
2 0.000 0.000 0.000 0.000 internals.py:2824(ndim)
2 0.000 0.000 0.000 0.000
internals.py:2864(_is_single_block)
2 0.000 0.000 0.000 0.000 internals.py:2897(_get_items)
2 0.000 0.000 0.001 0.000 internals.py:2917(get_dtypes)
2 0.000 0.000 0.000 0.000 internals.py:2918(<listcomp>)
2 0.000 0.000 0.000 0.000 internals.py:2986(__len__)
20 0.000 0.000 0.000 0.000 internals.py:303(dtype)
2 0.000 0.000 0.000 0.000 internals.py:3296(is_consolidated)
2 0.000 0.000 0.000 0.000 internals.py:3438(as_matrix)
2 0.000 0.000 0.000 0.000 internals.py:3560(consolidate)
2 0.000 0.000 0.000 0.000 internals.py:4078(__init__)
21 0.000 0.000 0.000 0.000 internals.py:4124(_block)
18 0.000 0.000 0.000 0.000 internals.py:4194(dtype)
3 0.000 0.000 0.000 0.000 internals.py:4221(internal_values)
1 0.000 0.000 0.000 0.000
missing.py:119(_isnull_ndarraylike)
1 0.000 0.000 0.000 0.000 missing.py:26(isnull)
1 0.000 0.000 0.000 0.000 missing.py:47(_isnull_new)
6025 0.029 0.000 0.104 0.000 printing.py:157(pprint_thing)
6025 0.027 0.000 0.038 0.000
printing.py:186(as_escaped_unicode)
3 0.000 0.000 0.000 0.000 series.py:1049(__iter__)
2 0.000 0.000 0.001 0.000 series.py:139(__init__)
2 0.000 0.000 0.000 0.000 series.py:284(_set_axis)
2 0.000 0.000 0.000 0.000 series.py:2894(_sanitize_array)
2 0.000 0.000 0.000 0.000 series.py:2911(_try_cast)
2 0.000 0.000 0.000 0.000 series.py:310(_set_subtyp)
2 0.000 0.000 0.000 0.000 series.py:320(name)
2 0.000 0.000 0.000 0.000 series.py:324(name)
18 0.000 0.000 0.000 0.000 series.py:331(dtype)
3 0.000 0.000 0.000 0.000 series.py:384(_values)
1 0.000 0.000 0.000 0.000 {built-in method _ctypes.POINTER}
3 0.000 0.000 0.000 0.000 {built-in method _ctypes.byref}
4 0.003 0.001 0.028 0.007 {built-in method builtins.all}
6025 0.003 0.000 0.008 0.000 {built-in method builtins.any}
1 0.000 0.000 51.970 51.970 {built-in method builtins.exec}
30 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
12085 0.018 0.000 0.018 0.000 {built-in method builtins.hasattr}
36332 0.010 0.000 0.010 0.000 {built-in method
builtins.isinstance}
30 0.000 0.000 0.000 0.000 {built-in method
builtins.issubclass}
6028 0.002 0.000 0.002 0.000 {built-in method builtins.iter}
6066/6057 0.001 0.000 0.001 0.000 {built-in method builtins.len}
4 0.000 0.000 0.000 0.000 {built-in method builtins.setattr}
7 2.229 0.318 2.229 0.318 {built-in method
numpy.core.multiarray.array}
3 0.000 0.000 0.000 0.000 {built-in method
numpy.core.multiarray.empty}
2 0.000 0.000 0.000 0.000 {built-in method
pandas._libs.algos.ensure_int64}
2 0.000 0.000 0.000 0.000 {built-in method
pandas._libs.algos.ensure_object}
2 0.000 0.000 0.000 0.000 {built-in method
pandas._libs.lib.is_datetime_array}
1 0.000 0.000 0.000 0.000 {built-in method
pandas._libs.lib.isscalar}
1 0.000 0.000 0.000 0.000 {method 'any' of 'numpy.ndarray'
objects}
2 6.314 3.157 6.314 3.157 {method 'astype' of
'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of
'_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'encode' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'ravel' of
'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc'
objects}
18075 0.009 0.000 0.009 0.000 {method 'replace' of 'str'
objects}
2 15.291 7.645 15.291 7.645 {method 'reshape' of
'numpy.ndarray' objects}
4 0.000 0.000 0.000 0.000 {method 'startswith' of 'str'
objects}
3 0.000 0.000 0.000 0.000 {method 'view' of 'numpy.ndarray'
objects}
2 0.000 0.000 0.000 0.000
{pandas._libs.algos.take_1d_object_object}
1 0.000 0.000 0.000 0.000 {pandas._libs.lib.isnullobj}
1 0.000 0.000 0.000 0.000
{pandas._libs.lib.maybe_convert_objects}
{noformat}
> Compatibility with xgboost
> --------------------------
>
> Key: ARROW-1374
> URL: https://issues.apache.org/jira/browse/ARROW-1374
> Project: Apache Arrow
> Issue Type: Wish
> Reporter: Steven Anton
> Priority: Minor
>
> Traditionally I work with CSV's and really suffer with slow read/write times.
> Parquet and the Arrow project obviously give us huge speedups.
> One thing I've noticed, however, is that there is a serious bottleneck when
> converting a DataFrame read in through pyarrow to a DMatrix used by xgboost.
> For example, I'm building a model with about 180k rows and 6k float64
> columns. Reading into a pandas DataFrame takes about 20 seconds on my
> machine. However, converting that DataFrame to a DMatrix takes well over 10
> minutes.
> Interestingly, it takes about 10 minutes to read that same data from a CSV
> into a pandas DataFrame. Then, it takes less than a minute to convert to a
> DMatrix.
> I'm sure there's a good technical explanation for why this happens (e.g. row
> vs column storage). Still, I imagine this use case may occur to many and it
> would be great to improve these times, if possible.
> {code:none}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import xgboost as xgb
> # Reading from parquet:
> table = pq.read_table('/path/to/parquet/files') # 20 seconds
> variables = table.to_pandas() # 1 second
> dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])
> # takes 10-15 minutes
> # Reading from CSV:
> variables = pd.read_csv('/path/to/file.csv', ...) # takes about 10 minutes
> dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])
> # less than 1 minute
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)