[ 
https://issues.apache.org/jira/browse/ARROW-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13965.
------------------------------------
    Fix Version/s: 6.0.0
       Resolution: Fixed

Issue resolved by pull request 11131
[https://github.com/apache/arrow/pull/11131]

> [C++] dynamic_casts in parquet TypedColumnWriterImpl impacting performance
> --------------------------------------------------------------------------
>
>                 Key: ARROW-13965
>                 URL: https://issues.apache.org/jira/browse/ARROW-13965
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>         Environment: arrow 6.0.0-SNAPSHOT on both RHEL8 (gcc 8.4.1) and MacOS 
> 11.5.2 (clang 11.0.0)
>            Reporter: Edward Seidl
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 6.0.0
>
>         Attachments: arrow_downcast.patch
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The methods WriteDictionaryPage(), CheckDictionarySizeLimit(), WriteValues(), 
> and WriteValuesSpaced() in TypedColumnWriterImpl 
> (cpp/src/parquet/column_writer.cc) perform dynamic_casts of the current_dict_ 
> object to either DictEncoder or ValueEncoderType pointers.  When calling 
> WriteBatch() with a large number of values this is ok, but when writing 
> batches of 1 (as when using the stream api), these dynamic casts can consume 
> a great deal of cpu.  Using gperftools against code I wrote to do a log 
> structured merge of several parquet files, I measured the dynamic_casts 
> taking as much as 25% of execution time.
> By modifying TypedColumnWriterImpl to save downcasted observer pointers of 
> the appropriate types, I was able to cut my execution time from 32 to 24 
> seconds, validating the gpertools results.  I've attached a patch to show 
> what I did.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to