----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30281/#review71859 -----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java <https://reviews.apache.org/r/30281/#comment117736> I don't see the need to loop over the LIST-annotated group's fields. It looks like this is how it was done in the previous version, but I think it would be better to validate that the incoming Type matches the assumptions made by this code: that there is a 3-level structure that matches PARQUET-113. If we don't want to validate the structure on each method call, that's fine for performance. But I think we should still write this code to reflect that there will only ever be a single field. That makes it possible to eliminate the loop and simplify: ```java Type fieldType = repeatedType.getType(i); String fieldName = fieldType.getName(); // should be "element" for (Object element : arrayValues) { recordConsumer.startGroup(); if (element != null) { recordConsumer.startField(fieldName, 0); writeValue(element, elementInspector, fieldType); recordConsumer.endField(fieldName, 0); } recordConsumer.endGroup(); } ``` ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java <https://reviews.apache.org/r/30281/#comment117737> Like the comment on writeArray, this could be simplified by making assumptions about the schema passed in. This already makes one: that the type has at least one field that is a group. Simplifying here would get rid of the inner loop over repeated type (the key and value types would be extracted separately) and get rid of the ternary statements. ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java <https://reviews.apache.org/r/30281/#comment117739> Rather than removing all of the schema strings, could you verify that the converted schema matches the expected one? I know it's technically testing a different class, but there's much less magic to the test if you can see the target Parquet schema. ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java <https://reviews.apache.org/r/30281/#comment117740> These name changes should be reverted, they previous names are required by PARQUET-113. - Ryan Blue On Jan. 29, 2015, 9:12 a.m., Sergio Pena wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/30281/ > ----------------------------------------------------------- > > (Updated Jan. 29, 2015, 9:12 a.m.) > > > Review request for hive, Ryan Blue, cheng xu, and Dong Chen. > > > Bugs: HIVE-9333 > https://issues.apache.org/jira/browse/HIVE-9333 > > > Repository: hive-git > > > Description > ------- > > This patch moves the ParquetHiveSerDe.serialize() implementation to > DataWritableWriter class in order to save time in materializing data on > serialize(). > > > Diffs > ----- > > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java > ea4109d358f7c48d1e2042e5da299475de4a0a29 > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java > 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465 > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java > 060b1b722d32f3b2f88304a1a73eb249e150294b > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java > 41b5f1c3b0ab43f734f8a211e3e03d5060c75434 > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java > e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 > > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java > a693aff18516d133abf0aae4847d3fe00b9f1c96 > > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java > 667d3671547190d363107019cd9a2d105d26d336 > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java > 007a665529857bcec612f638a157aa5043562a15 > serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetWritable.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/30281/diff/ > > > Testing > ------- > > The tests run were the following: > > 1. JMH (Java microbenchmark) > > This benchmark called parquet serialize/write methods using text writable > objects. > > Class.method Before Change (ops/s) After Change (ops/s) > > ------------------------------------------------------------------------------- > ParquetHiveSerDe.serialize: 19,113 249,528 -> > 19x speed increase > DataWritableWriter.write: 5,033 5,201 -> > 3.34% speed increase > > > 2. Write 20 million rows (~1GB file) from Text to Parquet > > I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format > using the following > statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text; > > Time (s) it took to write the whole file BEFORE changes: 93.758 s > Time (s) it took to write the whole file AFTER changes: 83.903 s > > It got a 10% of speed inscrease. > > > Thanks, > > Sergio Pena > >