randolf-scholz commented on issue #37055: URL: https://github.com/apache/arrow/issues/37055#issuecomment-1772555892
@js8544 The dataset in question we table "hosp/labevents.csv" from the MIMIC-IV dataset: https://physionet.org/content/mimiciv/2.2/. I changed my own preprocessing, so it doesn't really affect me anymore, but I was able to reproduce it in pyarrow 13: 1. Read the csv file, parsing the `"value"`-column to `dictionary[int32, string]` 2. `%timeit table["value"].value_counts()`: 10.5 s ± 102 ms (on desktop, was worse on laptop with fewer cores) 2. `%timeit table["value"].combine_chunks().value_counts()`: 1.29 s ± 12.9 ms The stats of the data are: - `length`: 118,171,367 - `null_count`: 19,803,023 (~17%) - `num_chunks`: 13095 - `num_unique`: 39160 - binary entropy (non-null): 9.48 bits - [normalized entropy](https://mc-stan.org/posterior/reference/entropy.html): 62% -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
