[ https://issues.apache.org/jira/browse/HIVE-7144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039731#comment-14039731 ]
Gopal V commented on HIVE-7144: ------------------------------- Benchmark insert of TPC-H 1Tb scale data || Table || Before || After || Diff || | part | 46.497 | 45.508 | 0.989 | | partsupp | 145.031 | 144.841 | 0.19 | | customer | 55.315 | 55.745 | -0.430 | | orders | 246.692 | 217.834 | 28.858 | | lineitem | 959.995 | 875.659 | 84.336 | This makes negligible difference to the smaller tables, but gives a ~10% boost for orders (50.8Gb ORC) and lineitem (224.5Gb ORC). The optimization does not benefit the sorted string columns, because for every row read there's a data-copy to populate either the minimum or maximum in the column statistics section (Strings are immutable, Text is Writable). > GC pressure during ORC StringDictionary writes > ----------------------------------------------- > > Key: HIVE-7144 > URL: https://issues.apache.org/jira/browse/HIVE-7144 > Project: Hive > Issue Type: Bug > Components: File Formats > Affects Versions: 0.14.0 > Environment: ORC Table ~ 12 string columns > Reporter: Gopal V > Assignee: Gopal V > Labels: ORC, Performance > Attachments: HIVE-7144.1.patch, orc-string-write.png > > > When ORC string dictionary writes data out, it suffers from bad GC > performance due to a few allocations in-loop. > !orc-string-write.png! > The conversions are as follows > StringTreeWriter::getStringValue() causes 2 conversions > LazyString -> Text (LazyString::getWritableObject) > Text -> String (LazyStringObjectInspector::getPrimitiveJavaObject) > Then StringRedBlackTree::add() does one conversion > String -> Text > This causes some GC pressure with un-necessary String and byte[] array > allocations. -- This message was sent by Atlassian JIRA (v6.2#6252)