[ https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961132#comment-15961132 ]
Ashima Sood commented on ARROW-785: ----------------------------------- I'm using Zeppelin to create a table on top of the parquet file using below command: %sql CREATE EXTERNAL TABLE IF NOT EXISTS schema_abc.parquet_table_name( YEAR INT , WORD STRING ) STORED AS PARQUET LOCATION 's3://bucket_name/folder/parquet_files/' ***Please note: parquet_files folder has the testFile.parquet file in it. Describing the table: %spark.sql describe table schema_abc.parquet_table_name Gives: col_name data_type comment YEAR int null WORD string null but when I run a select query to read the table.. It gives me below error: Describing the table: %spark.sql select * from schema_abc.parquet_table_name java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Since I was getting below error I wanted to see if parquet file really is showing the data. hence used Apache Drill to view the data which outputs like below: user@server:parth_to/parquet-drill/apache-drill-1.10.0$ bin/drill-embedded Apr 07, 2017 1:04:51 PM org.glassfish.jersey.server.ApplicationHandler initialize INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26... apache drill 1.10.0 "drill baby drill" 0: jdbc:drill:zk=local> select * from dfs.`/path_to/parquet-drill/apache-drill-1.10.0/sample-data/testFile.parquet`; SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. +-------+--------------+ | YEAR | WORD | +-------+--------------+ | 2017 | null | | 2018 | [B@5bd466f2 | +-------+--------------+ 2 rows selected (1.433 seconds) Input Txt file: YEAR|WORD 2017| 2018|Word 2 (i've put null to test if that works) > possible issue on writing parquet via pyarrow, subsequently read in Hive > ------------------------------------------------------------------------ > > Key: ARROW-785 > URL: https://issues.apache.org/jira/browse/ARROW-785 > Project: Apache Arrow > Issue Type: Bug > Reporter: Jeff Reback > Priority: Minor > Fix For: 0.3.0 > > > details here: > http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f > This round trips in pandas->parquet->pandas just fine on released pandas > (0.19.2) and pyarrow (0.2). > OP stats that it is not readable in Hive however. -- This message was sent by Atlassian JIRA (v6.3.15#6346)