[ 
https://issues.apache.org/jira/browse/ARROW-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133860#comment-16133860
 ] 

Steven Anton commented on ARROW-1374:
-------------------------------------

Thanks! Ok, I got the conversion time down to 52 seconds by following your 
suggestions. It sounds like there aren't really any changes needed. I had just 
misunderstood what was going on under the hood. Maybe some documentation or a 
"gotchas" section?

Here's what I did differently:

{code:none}
schema = pq.read_schema('/path/to/file.parq')
columns = [x for x in schema.names if x not in ['tag', ...]]
variables = pq.read_table(str(parquet_path), columns=columns).to_pandas()
# It looks like I could have used the .remove_column method instead of the above
tag = pq.read_table(str(parquet_path), columns=['is_fraud']).to_pandas()
cProfile.run('xgb.DMatrix(variables, label=tag)')
{code}

And the output of cProfile is below, just for reference.

{noformat}
         145353 function calls (145344 primitive calls) in 51.970 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        7    0.000    0.000    0.000    0.000 <frozen 
importlib._bootstrap>:996(_handle_fromlist)
        1    0.030    0.030   51.970   51.970 <string>:1(<module>)
        4    0.000    0.000    0.002    0.000 __init__.py:357(__getattr__)
        4    0.002    0.000    0.002    0.000 __init__.py:364(__getitem__)
        1    0.000    0.000    0.000    0.000 __init__.py:483(cast)
        1    0.000    0.000    0.000    0.000 _internal.py:225(__init__)
        1    0.000    0.000    0.000    0.000 _internal.py:237(data_as)
        1    0.000    0.000    0.000    0.000 _methods.py:37(_any)
        2    0.000    0.000    0.000    0.000 
algorithms.py:1342(_get_take_nd_function)
        2    0.000    0.000    0.001    0.000 algorithms.py:1375(take_nd)
        2    0.000    0.000    0.000    0.000 base.py:1578(is_all_dates)
        1    0.000    0.000    0.114    0.114 base.py:1884(format)
        1    0.000    0.000    0.114    0.114 base.py:1899(_format_with_header)
        1    0.010    0.010    0.113    0.113 base.py:1910(<listcomp>)
        2    0.000    0.000    0.000    0.000 base.py:3999(_ensure_index)
        7    0.000    0.000    0.000    0.000 base.py:528(__len__)
        3    0.000    0.000    0.000    0.000 base.py:559(values)
        2    0.000    0.000    0.000    0.000 cast.py:759(maybe_castable)
        2    0.000    0.000    0.000    0.000 
cast.py:868(maybe_cast_to_datetime)
        2    0.000    0.000    0.000    0.000 common.py:117(is_sparse)
        1    0.000    0.000    0.000    0.000 
common.py:1419(is_string_like_dtype)
        2    0.000    0.000    0.000    0.000 common.py:1456(is_float_dtype)
        2    0.000    0.000    0.000    0.000 common.py:1549(is_extension_type)
        2    0.000    0.000    0.000    0.000 common.py:1673(_get_dtype)
       16    0.000    0.000    0.000    0.000 common.py:1722(_get_dtype_type)
        4    0.000    0.000    0.000    0.000 common.py:1852(pandas_dtype)
        6    0.000    0.000    0.000    0.000 common.py:190(is_categorical)
        9    0.000    0.000    0.000    0.000 common.py:222(is_datetimetz)
        7    0.000    0.000    0.000    0.000 common.py:296(is_datetime64_dtype)
       14    0.000    0.000    0.000    0.000 
common.py:333(is_datetime64tz_dtype)
        5    0.000    0.000    0.000    0.000 
common.py:371(is_timedelta64_dtype)
        1    0.000    0.000    0.000    0.000 common.py:406(is_period_dtype)
        3    0.000    0.000    0.000    0.000 common.py:439(is_interval_dtype)
        8    0.000    0.000    0.000    0.000 
common.py:475(is_categorical_dtype)
        1    0.000    0.000    0.000    0.000 common.py:508(is_string_dtype)
        3    0.000    0.000    0.000    0.000 common.py:609(is_datetimelike)
        4    0.000    0.000    0.000    0.000 common.py:84(is_object_dtype)
        5    0.000    0.000    0.000    0.000 core.py:115(_check_call)
        1    0.000    0.000    0.000    0.000 core.py:152(c_str)
        1    0.150    0.150    0.150    0.150 core.py:157(c_array)
        1    0.000    0.000    6.442    6.442 core.py:168(_maybe_pandas_data)
     6026    0.008    0.000    0.008    0.000 core.py:175(<genexpr>)
        1    0.003    0.003    0.003    0.003 core.py:187(<listcomp>)
        1    0.000    0.000    0.001    0.001 core.py:194(_maybe_pandas_label)
        2    0.000    0.000    0.000    0.000 core.py:202(<genexpr>)
        1    0.000    0.000   51.890   51.890 core.py:222(__init__)
        1   27.752   27.752   45.272   45.272 core.py:309(_init_from_npy2d)
        1    0.049    0.049    0.050    0.050 core.py:323(__del__)
        1    0.001    0.001    0.152    0.152 core.py:368(set_float_info)
        1    0.000    0.000    0.152    0.152 core.py:414(set_label)
        2    0.000    0.000    0.001    0.000 core.py:501(num_col)
        1    0.004    0.004    0.021    0.021 core.py:557(feature_names)
     6026    0.007    0.000    0.015    0.000 core.py:576(<genexpr>)
    24100    0.004    0.000    0.004    0.000 core.py:577(<genexpr>)
        1    0.000    0.000    0.003    0.003 core.py:585(feature_types)
     6026    0.002    0.000    0.002    0.000 core.py:614(<genexpr>)
        1    0.000    0.000    0.000    0.000 dtypes.py:367(is_dtype)
        3    0.000    0.000    0.000    0.000 dtypes.py:489(is_dtype)
       26    0.000    0.000    0.000    0.000 dtypes.py:84(is_dtype)
        2    0.000    0.000    0.000    0.000 generic.py:117(__init__)
        2    0.000    0.000    0.000    0.000 generic.py:145(_validate_dtype)
        2    0.000    0.000    0.000    0.000 generic.py:3067(__getattr__)
        4    0.000    0.000    0.000    0.000 generic.py:3083(__setattr__)
        2    0.000    0.000    0.000    0.000 
generic.py:3122(_protect_consolidate)
        2    0.000    0.000    0.000    0.000 
generic.py:3132(_consolidate_inplace)
        2    0.000    0.000    0.000    0.000 generic.py:3135(f)
        2    0.000    0.000    0.000    0.000 generic.py:3214(as_matrix)
        2    0.000    0.000    0.000    0.000 generic.py:3256(values)
        2    0.000    0.000    0.002    0.001 generic.py:3298(dtypes)
        2    0.000    0.000    0.000    0.000 generic.py:416(_info_axis)
       25    0.000    0.000    0.000    0.000 generic.py:7(_check)
     6025    0.009    0.000    0.014    0.000 inference.py:396(is_sequence)
        2    0.000    0.000    0.000    0.000 internals.py:102(__init__)
        3    0.000    0.000    0.000    0.000 internals.py:154(internal_values)
        2    0.000    0.000    0.000    0.000 internals.py:160(get_values)
        2    0.000    0.000    0.000    0.000 internals.py:1838(__init__)
        6    0.000    0.000    0.000    0.000 internals.py:185(mgr_locs)
        2    0.000    0.000    0.000    0.000 internals.py:222(mgr_locs)
        2    0.000    0.000    0.000    0.000 internals.py:2683(make_block)
        2    0.000    0.000    0.000    0.000 internals.py:2824(ndim)
        2    0.000    0.000    0.000    0.000 
internals.py:2864(_is_single_block)
        2    0.000    0.000    0.000    0.000 internals.py:2897(_get_items)
        2    0.000    0.000    0.001    0.000 internals.py:2917(get_dtypes)
        2    0.000    0.000    0.000    0.000 internals.py:2918(<listcomp>)
        2    0.000    0.000    0.000    0.000 internals.py:2986(__len__)
       20    0.000    0.000    0.000    0.000 internals.py:303(dtype)
        2    0.000    0.000    0.000    0.000 internals.py:3296(is_consolidated)
        2    0.000    0.000    0.000    0.000 internals.py:3438(as_matrix)
        2    0.000    0.000    0.000    0.000 internals.py:3560(consolidate)
        2    0.000    0.000    0.000    0.000 internals.py:4078(__init__)
       21    0.000    0.000    0.000    0.000 internals.py:4124(_block)
       18    0.000    0.000    0.000    0.000 internals.py:4194(dtype)
        3    0.000    0.000    0.000    0.000 internals.py:4221(internal_values)
        1    0.000    0.000    0.000    0.000 
missing.py:119(_isnull_ndarraylike)
        1    0.000    0.000    0.000    0.000 missing.py:26(isnull)
        1    0.000    0.000    0.000    0.000 missing.py:47(_isnull_new)
     6025    0.029    0.000    0.104    0.000 printing.py:157(pprint_thing)
     6025    0.027    0.000    0.038    0.000 
printing.py:186(as_escaped_unicode)
        3    0.000    0.000    0.000    0.000 series.py:1049(__iter__)
        2    0.000    0.000    0.001    0.000 series.py:139(__init__)
        2    0.000    0.000    0.000    0.000 series.py:284(_set_axis)
        2    0.000    0.000    0.000    0.000 series.py:2894(_sanitize_array)
        2    0.000    0.000    0.000    0.000 series.py:2911(_try_cast)
        2    0.000    0.000    0.000    0.000 series.py:310(_set_subtyp)
        2    0.000    0.000    0.000    0.000 series.py:320(name)
        2    0.000    0.000    0.000    0.000 series.py:324(name)
       18    0.000    0.000    0.000    0.000 series.py:331(dtype)
        3    0.000    0.000    0.000    0.000 series.py:384(_values)
        1    0.000    0.000    0.000    0.000 {built-in method _ctypes.POINTER}
        3    0.000    0.000    0.000    0.000 {built-in method _ctypes.byref}
        4    0.003    0.001    0.028    0.007 {built-in method builtins.all}
     6025    0.003    0.000    0.008    0.000 {built-in method builtins.any}
        1    0.000    0.000   51.970   51.970 {built-in method builtins.exec}
       30    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
    12085    0.018    0.000    0.018    0.000 {built-in method builtins.hasattr}
    36332    0.010    0.000    0.010    0.000 {built-in method 
builtins.isinstance}
       30    0.000    0.000    0.000    0.000 {built-in method 
builtins.issubclass}
     6028    0.002    0.000    0.002    0.000 {built-in method builtins.iter}
6066/6057    0.001    0.000    0.001    0.000 {built-in method builtins.len}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.setattr}
        7    2.229    0.318    2.229    0.318 {built-in method 
numpy.core.multiarray.array}
        3    0.000    0.000    0.000    0.000 {built-in method 
numpy.core.multiarray.empty}
        2    0.000    0.000    0.000    0.000 {built-in method 
pandas._libs.algos.ensure_int64}
        2    0.000    0.000    0.000    0.000 {built-in method 
pandas._libs.algos.ensure_object}
        2    0.000    0.000    0.000    0.000 {built-in method 
pandas._libs.lib.is_datetime_array}
        1    0.000    0.000    0.000    0.000 {built-in method 
pandas._libs.lib.isscalar}
        1    0.000    0.000    0.000    0.000 {method 'any' of 'numpy.ndarray' 
objects}
        2    6.314    3.157    6.314    3.157 {method 'astype' of 
'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of 
'_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'encode' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 
'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' 
objects}
    18075    0.009    0.000    0.009    0.000 {method 'replace' of 'str' 
objects}
        2   15.291    7.645   15.291    7.645 {method 'reshape' of 
'numpy.ndarray' objects}
        4    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' 
objects}
        3    0.000    0.000    0.000    0.000 {method 'view' of 'numpy.ndarray' 
objects}
        2    0.000    0.000    0.000    0.000 
{pandas._libs.algos.take_1d_object_object}
        1    0.000    0.000    0.000    0.000 {pandas._libs.lib.isnullobj}
        1    0.000    0.000    0.000    0.000 
{pandas._libs.lib.maybe_convert_objects}
{noformat}



> Compatibility with xgboost
> --------------------------
>
>                 Key: ARROW-1374
>                 URL: https://issues.apache.org/jira/browse/ARROW-1374
>             Project: Apache Arrow
>          Issue Type: Wish
>            Reporter: Steven Anton
>            Priority: Minor
>
> Traditionally I work with CSV's and really suffer with slow read/write times. 
> Parquet and the Arrow project obviously give us huge speedups.
> One thing I've noticed, however, is that there is a serious bottleneck when 
> converting a DataFrame read in through pyarrow to a DMatrix used by xgboost. 
> For example, I'm building a model with about 180k rows and 6k float64 
> columns. Reading into a pandas DataFrame takes about 20 seconds on my 
> machine. However, converting that DataFrame to a DMatrix takes well over 10 
> minutes.
> Interestingly, it takes about 10 minutes to read that same data from a CSV 
> into a pandas DataFrame. Then, it takes less than a minute to convert to a 
> DMatrix.
> I'm sure there's a good technical explanation for why this happens (e.g. row 
> vs column storage). Still, I imagine this use case may occur to many and it 
> would be great to improve these times, if possible.
> {code:none}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import xgboost as xgb
> # Reading from parquet:
> table = pq.read_table('/path/to/parquet/files')  # 20 seconds
> variables = table.to_pandas()  # 1 second
> dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag']) 
>  # takes 10-15 minutes
> # Reading from CSV:
> variables = pd.read_csv('/path/to/file.csv', ...)  # takes about 10 minutes
> dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag']) 
>  # less than 1 minute
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to