Re: [PR] GH-38432: [C++][Parquet] Trying to Fix regression in the DictByteArrayDecoderImpl [arrow]

via GitHub Fri, 24 Nov 2023 00:54:52 -0800


jorisvandenbossche commented on PR #38784:
URL: https://github.com/apache/arrow/pull/38784#issuecomment-1825335216


   If you want to more easily be able to reproduce it locally, the benchmark 
code is not that easy to re-use, so I extracted the necessary bits in a small 
script:
   
   This creates the parquet file:
   <details>
   
   ```python
   import pyarrow
   import pyarrow.csv
   import pyarrow.parquet as pq
   
   # download https://ursa-qa.s3.amazonaws.com/fanniemae_loanperf/2016Q4.csv.gz 
CSV file
   path = "Downloads/2016Q4.csv.gz"
   
   schema = pyarrow.schema([
       pyarrow.field("LOAN_ID", pyarrow.string()),
       # date. Monthly reporting period
       pyarrow.field("ACT_PERIOD", pyarrow.string()),
       pyarrow.field("SERVICER", pyarrow.string()),
       pyarrow.field("ORIG_RATE", pyarrow.float64()),
       pyarrow.field("CURRENT_UPB", pyarrow.float64()),
       pyarrow.field("LOAN_AGE", pyarrow.int32()),
       pyarrow.field("REM_MONTHS", pyarrow.int32()),
       pyarrow.field("ADJ_REM_MONTHS", pyarrow.int32()),
       # maturity date
       pyarrow.field("MATR_DT", pyarrow.string()),
       # Metropolitan Statistical Area code
       pyarrow.field("MSA", pyarrow.string()),
       # Int of months, but `X` is a valid value. New versions pad with `0`/`X` 
to two characters
       pyarrow.field("DLQ_STATUS", pyarrow.string()),
       pyarrow.field("RELOCATION_MORTGAGE_INDICATOR", pyarrow.string()),
       # 0-padded 2 digit ints representing categorical levels, e.g. "01" -> 
"Prepaid or Matured"
       pyarrow.field("Zero_Bal_Code", pyarrow.string()),
       pyarrow.field("ZB_DTE", pyarrow.string()),  # date
       pyarrow.field("LAST_PAID_INSTALLMENT_DATE", pyarrow.string()),
       pyarrow.field("FORECLOSURE_DATE", pyarrow.string()),
       pyarrow.field("DISPOSITION_DATE", pyarrow.string()),
       pyarrow.field("FORECLOSURE_COSTS", pyarrow.float64()),
       pyarrow.field("PROPERTY_PRESERVATION_AND_REPAIR_COSTS", 
pyarrow.float64()),
       pyarrow.field("ASSET_RECOVERY_COSTS", pyarrow.float64()),
       pyarrow.field(
           "MISCELLANEOUS_HOLDING_EXPENSES_AND_CREDITS", pyarrow.float64()
       ),
       pyarrow.field("ASSOCIATED_TAXES_FOR_HOLDING_PROPERTY", 
pyarrow.float64()),
       pyarrow.field("NET_SALES_PROCEEDS", pyarrow.float64()),
       pyarrow.field("CREDIT_ENHANCEMENT_PROCEEDS", pyarrow.float64()),
       pyarrow.field("REPURCHASES_MAKE_WHOLE_PROCEEDS", pyarrow.float64()),
       pyarrow.field("OTHER_FORECLOSURE_PROCEEDS", pyarrow.float64()),
       pyarrow.field("NON_INTEREST_BEARING_UPB", pyarrow.float64()),
       # all null
       pyarrow.field("MI_CANCEL_FLAG", pyarrow.string()),
       pyarrow.field("RE_PROCS_FLAG", pyarrow.string()),
       # all null
       pyarrow.field("LOAN_HOLDBACK_INDICATOR", pyarrow.string()),
       pyarrow.field("SERV_IND", pyarrow.string()),
   ])
   csv_read_options = pyarrow.csv.ReadOptions(
       autogenerate_column_names=False,
       column_names=schema.names,
   )
   csv_convert_options = pyarrow.csv.ConvertOptions(
       column_types=schema,
       strings_can_be_null=True,
   )
   csv_parse_options = pyarrow.csv.ParseOptions(delimiter="|")
   table = pyarrow.csv.read_csv(
       path,
       read_options=csv_read_options,
       parse_options=csv_parse_options,
       convert_options=csv_convert_options,
   )
   pq.write_table(table, "fanniemae_2016Q4.parquet", compression="snappy")
   ```
   
   </details>
   
   and then it is reading that file:
   
   ```
   import pyarrow.parquet as pq
   pq.read_table("fanniemae_2016Q4.parquet")
   ```
   
   When timing that with released versions of 13.0.0 vs 14.0.1, I do see a 
small but consistent difference (2.5 vs 2.9s for reading the whole file in 
parallel with the statement above). 
   
   Trying to pin it down something smaller, reading a single binary (string) 
column for one row group in one thread:
   
   ```
   In [6]: f = pq.ParquetFile("fanniemae_2016Q4.parquet")
   
   In [7]: %timeit -n 50 f.read_row_group(0, columns=["LOAN_ID"], 
use_threads=False)
   21.7 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
   25.6 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   
   this also gives a small but consistent difference of 21 vs 25ms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-38432: [C++][Parquet] Trying to Fix regression in the DictByteArrayDecoderImpl [arrow]

Reply via email to