jorisvandenbossche commented on PR #38784: URL: https://github.com/apache/arrow/pull/38784#issuecomment-1825335216
If you want to more easily be able to reproduce it locally, the benchmark code is not that easy to re-use, so I extracted the necessary bits in a small script: This creates the parquet file: <details> ```python import pyarrow import pyarrow.csv import pyarrow.parquet as pq # download https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz CSV file path = "Downloads/2016Q4.csv.gz" schema = pyarrow.schema([ pyarrow.field("LOAN_ID", pyarrow.string()), # date. Monthly reporting period pyarrow.field("ACT_PERIOD", pyarrow.string()), pyarrow.field("SERVICER", pyarrow.string()), pyarrow.field("ORIG_RATE", pyarrow.float64()), pyarrow.field("CURRENT_UPB", pyarrow.float64()), pyarrow.field("LOAN_AGE", pyarrow.int32()), pyarrow.field("REM_MONTHS", pyarrow.int32()), pyarrow.field("ADJ_REM_MONTHS", pyarrow.int32()), # maturity date pyarrow.field("MATR_DT", pyarrow.string()), # Metropolitan Statistical Area code pyarrow.field("MSA", pyarrow.string()), # Int of months, but `X` is a valid value. New versions pad with `0`/`X` to two characters pyarrow.field("DLQ_STATUS", pyarrow.string()), pyarrow.field("RELOCATION_MORTGAGE_INDICATOR", pyarrow.string()), # 0-padded 2 digit ints representing categorical levels, e.g. "01" -> "Prepaid or Matured" pyarrow.field("Zero_Bal_Code", pyarrow.string()), pyarrow.field("ZB_DTE", pyarrow.string()), # date pyarrow.field("LAST_PAID_INSTALLMENT_DATE", pyarrow.string()), pyarrow.field("FORECLOSURE_DATE", pyarrow.string()), pyarrow.field("DISPOSITION_DATE", pyarrow.string()), pyarrow.field("FORECLOSURE_COSTS", pyarrow.float64()), pyarrow.field("PROPERTY_PRESERVATION_AND_REPAIR_COSTS", pyarrow.float64()), pyarrow.field("ASSET_RECOVERY_COSTS", pyarrow.float64()), pyarrow.field( "MISCELLANEOUS_HOLDING_EXPENSES_AND_CREDITS", pyarrow.float64() ), pyarrow.field("ASSOCIATED_TAXES_FOR_HOLDING_PROPERTY", pyarrow.float64()), pyarrow.field("NET_SALES_PROCEEDS", pyarrow.float64()), pyarrow.field("CREDIT_ENHANCEMENT_PROCEEDS", pyarrow.float64()), pyarrow.field("REPURCHASES_MAKE_WHOLE_PROCEEDS", pyarrow.float64()), pyarrow.field("OTHER_FORECLOSURE_PROCEEDS", pyarrow.float64()), pyarrow.field("NON_INTEREST_BEARING_UPB", pyarrow.float64()), # all null pyarrow.field("MI_CANCEL_FLAG", pyarrow.string()), pyarrow.field("RE_PROCS_FLAG", pyarrow.string()), # all null pyarrow.field("LOAN_HOLDBACK_INDICATOR", pyarrow.string()), pyarrow.field("SERV_IND", pyarrow.string()), ]) csv_read_options = pyarrow.csv.ReadOptions( autogenerate_column_names=False, column_names=schema.names, ) csv_convert_options = pyarrow.csv.ConvertOptions( column_types=schema, strings_can_be_null=True, ) csv_parse_options = pyarrow.csv.ParseOptions(delimiter="|") table = pyarrow.csv.read_csv( path, read_options=csv_read_options, parse_options=csv_parse_options, convert_options=csv_convert_options, ) pq.write_table(table, "fanniemae_2016Q4.parquet", compression="snappy") ``` </details> and then it is reading that file: ``` import pyarrow.parquet as pq pq.read_table("fanniemae_2016Q4.parquet") ``` When timing that with released versions of 13.0.0 vs 14.0.1, I do see a small but consistent difference (2.5 vs 2.9s for reading the whole file in parallel with the statement above). Trying to pin it down something smaller, reading a single binary (string) column for one row group in one thread: ``` In [6]: f = pq.ParquetFile("fanniemae_2016Q4.parquet") In [7]: %timeit -n 50 f.read_row_group(0, columns=["LOAN_ID"], use_threads=False) 21.7 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 50 loops each) 25.6 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` this also gives a small but consistent difference of 21 vs 25ms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
