[ https://issues.apache.org/jira/browse/DRILL-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jacques Nadeau resolved DRILL-389. ---------------------------------- Resolution: Fixed > Nested Parquet data generated from Hive does not work > ----------------------------------------------------- > > Key: DRILL-389 > URL: https://issues.apache.org/jira/browse/DRILL-389 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: m1 > Environment: CentOS 6.3 > CDH 4.6 installed by Cloudera Manager Free Edition > Hive 0.10.0 > Reporter: Thaddeus Diamond > Assignee: Jason Altekruse > Priority: Critical > Fix For: 0.5.0 > > Attachments: avro_test.db, nobench.ddl, nobench_1.avsc, > parquet-nobench_0.parquet > > > Inside of Hive, I generated Parquet data from Avro data as follows. Using > the attached Avro file (avro_test.db) and the attached nested Avro schema > (nobench_1.avsc), I created a Hive table: > {noformat} > CREATE TABLE avro_nobench_hdfs > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION 'hdfs:///user/hdfs/avro' > TBLPROPERTIES ('avro.schema.url'='hdfs:///user/hdfs/nobench.avsc'); > {noformat} > Note that this schema is based loosely off of the NoBench standard proposed > by Craig Chasseur for JSON (http://pages.cs.wisc.edu/~chasseur/). > In order to create a Parquet Hive table you need to create a full schema. > The one attached is very large, so I used the following: > {noformat} > sudo -u hdfs hive -e 'describe avro_nobench_hdfs' > /tmp/temp.sql > {noformat} > Then, I replaced the "from deserializer" with commas and added the following > SQL DDL around it: > {noformat} > CREATE TABLE avro_nobench_parquet ( > // ... COLUMNS HERE > ) > ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' > STORED AS > INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat" > OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"; > {noformat} > Finally, I generated the actual Parquet binary data using {{INSERT INTO}}: > {noformat} > INSERT OVERWRITE avro_nobench_parquet SELECT * FROM avro_nobench_hdfs; > {noformat} > This successfully completed. Then, the data was validated using: > {noformat} > SELECT COUNT(*) FROM avro_nobench_parquet; > SELECT * FROM avro_nobench_parquet LIMIT 1; > {noformat} > If you look in {{hdfs:///user/hive/warehouse/avro_nobench_parquet}} you'll > see a single raw file (something like {{0000_0}}). Download that to local: > {noformat} > sudo -u hdfs hdfs dfs -copyToLocal > /user/hive/warehouse/avro_nobench_parquet/* . > {noformat} > Then, in DRILL I ran: > {noformat} > SELECT COUNT(*) FROM "nobench.parquet"; > {noformat} > And got the following: > {noformat} > Caused by: org.apache.drill.exec.rpc.RpcException: Remote failure while > running query.[error_id: "a13783d0-d9da-4639-8809-ba4a5ac54e04" > endpoint { > address: "ip-10-101-1-82.ec2.internal" > user_port: 31010 > bit_port: 32011 > } > error_type: 0 > message: "Failure while running fragment. < NullPointerException" > ] > at > org.apache.drill.exec.rpc.user.QueryResultHandler.batchArrived(QueryResultHandler.java:72) > at > org.apache.drill.exec.rpc.user.UserClient.handle(UserClient.java:79) > at > org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:48) > at > org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:33) > at > org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:142) > at > org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:127) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:173) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:334) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:320) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:100) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:497) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:465) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:359) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) > at java.lang.Thread.run(Thread.java:744) > {noformat} > The second time I run it I get an OOM: > {noformat} > Exception in thread "WorkManager-3" java.lang.OutOfMemoryError: Java heap > space > at > org.apache.drill.exec.store.parquet.PageReadStatus.<init>(PageReadStatus.java:41) > at > org.apache.drill.exec.store.parquet.ColumnReader.<init>(ColumnReader.java:70) > at > org.apache.drill.exec.store.parquet.VarLenBinaryReader$NullableVarLengthColumn.<init>(VarLenBinaryReader.java:62) > at > org.apache.drill.exec.store.parquet.ParquetRecordReader.<init>(ParquetRecordReader.java:167) > at > org.apache.drill.exec.store.parquet.ParquetRecordReader.<init>(ParquetRecordReader.java:99) > at > org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:60) > at > org.apache.drill.exec.physical.impl.ImplCreator.visitSubScan(ImplCreator.java:103) > at > org.apache.drill.exec.physical.impl.ImplCreator.visitSubScan(ImplCreator.java:63) > at > org.apache.drill.exec.store.parquet.ParquetRowGroupScan.accept(ParquetRowGroupScan.java:107) > at > org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173) > at > org.apache.drill.exec.physical.impl.ImplCreator.visitProject(ImplCreator.java:90) > at > org.apache.drill.exec.physical.impl.ImplCreator.visitProject(ImplCreator.java:63) > at > org.apache.drill.exec.physical.config.Project.accept(Project.java:51) > at > org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173) > at > org.apache.drill.exec.physical.impl.ImplCreator.visitSort(ImplCreator.java:121) > at > org.apache.drill.exec.physical.impl.ImplCreator.visitSort(ImplCreator.java:63) > at org.apache.drill.exec.physical.config.Sort.accept(Sort.java:58) > at > org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173) > at > org.apache.drill.exec.physical.impl.ImplCreator.visitStreamingAggregate(ImplCreator.java:151) > at > org.apache.drill.exec.physical.impl.ImplCreator.visitStreamingAggregate(ImplCreator.java:63) > at > org.apache.drill.exec.physical.config.StreamingAggregate.accept(StreamingAggregate.java:59) > at > org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:173) > at > org.apache.drill.exec.physical.impl.ImplCreator.visitScreen(ImplCreator.java:132) > at > org.apache.drill.exec.physical.impl.ImplCreator.visitScreen(ImplCreator.java:63) > at > org.apache.drill.exec.physical.config.Screen.accept(Screen.java:102) > at > org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:180) > at > org.apache.drill.exec.work.foreman.RunningFragmentManager.runFragments(RunningFragmentManager.java:84) > at > org.apache.drill.exec.work.foreman.Foreman.runPhysicalPlan(Foreman.java:228) > at > org.apache.drill.exec.work.foreman.Foreman.parseAndRunLogicalPlan(Foreman.java:176) > at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:153) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)