[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

Ashima Sood (JIRA) Fri, 07 Apr 2017 10:15:06 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961132#comment-15961132
 ]


Ashima Sood commented on ARROW-785:
-----------------------------------

I'm using Zeppelin to create a table on top of the parquet file using below 
command:
%sql
CREATE EXTERNAL TABLE IF NOT EXISTS schema_abc.parquet_table_name(
  YEAR INT
, WORD STRING
)
STORED AS PARQUET
LOCATION 's3://bucket_name/folder/parquet_files/'

***Please note: parquet_files folder has the testFile.parquet file in it.

Describing the table:

%spark.sql
describe table schema_abc.parquet_table_name

Gives:
col_name   data_type  comment
YEAR    int     null
WORD    string  null


but when I run a select query to read the table.. It gives me below error:

Describing the table:

%spark.sql
select * from schema_abc.parquet_table_name

java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
        at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
        at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


Since I was getting below error I wanted to see if parquet file really is 
showing the data. hence used Apache Drill to view the data which outputs like 
below:


user@server:parth_to/parquet-drill/apache-drill-1.10.0$ bin/drill-embedded
Apr 07, 2017 1:04:51 PM org.glassfish.jersey.server.ApplicationHandler 
initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.10.0
"drill baby drill"
0: jdbc:drill:zk=local> select * from 
dfs.`/path_to/parquet-drill/apache-drill-1.10.0/sample-data/testFile.parquet`;
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
+-------+--------------+
| YEAR  |     WORD     |
+-------+--------------+
| 2017  | null         |
| 2018  | [B@5bd466f2  |
+-------+--------------+
2 rows selected (1.433 seconds)


Input Txt file:
YEAR|WORD
2017|
2018|Word 2

(i've put null to test if that works)


> possible issue on writing parquet via pyarrow, subsequently read in Hive
> ------------------------------------------------------------------------
>
>                 Key: ARROW-785
>                 URL: https://issues.apache.org/jira/browse/ARROW-785
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Jeff Reback
>            Priority: Minor
>             Fix For: 0.3.0
>
>
> details here: 
> http://stackoverflow.com/questions/43268872/parquet-creation-conversion-from-pandas-dataframe-to-pyarrow-table-not-working-f
> This round trips in pandas->parquet->pandas just fine on released pandas 
> (0.19.2) and pyarrow (0.2).
> OP stats that it is not readable in Hive however.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (ARROW-785) possible issue on writing parquet via pyarrow, subsequently read in Hive

Reply via email to