[
https://issues.apache.org/jira/browse/SPARK-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltan Fedor updated SPARK-13572:
---------------------------------
Description:
I am using PySpark to read avro-based tables from Hive and while the avro
tables can be read, some of the columns are incorrectly read - showing value
"None" instead of the actual value.
>>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest
>>> where year=2016 and month=2 and day=29""")
>>> results_df.take(3)
[Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
statvalue=None, displayname=None, category=None,
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
statvalue=None, displayname=None, category=None,
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
statvalue=None, displayname=None, category=None,
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
Observe the "None" values at most of the fields.
Running the same query in Hive:
c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where
year=2016 and month=2 and day=29 limit 3;
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| opsconsole_ingest.kafkaoffsetgeneration | opsconsole_ingest.kafkapartition
| opsconsole_ingest.kafkaoffset | opsconsole_ingest.uuid |
opsconsole_ingest.mid | opsconsole_ingest.iid |
opsconsole_ingest.product | opsconsole_ingest.utctime |
opsconsole_ingest.statcode | opsconsole_ingest.statvalue |
opsconsole_ingest.displayname | opsconsole_ingest.category |
opsconsole_ingest.source_filename | opsconsole_ingest.year |
opsconsole_ingest.month | opsconsole_ingest.day |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| 11.0 | 0.0
| 3.83399394E8 | EF0D03C409681B98646F316CA1088973 |
174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000 |
est | 2016-01-13T06:58:19 | 8
| 3.0 SP11 (8.110.7601.18923) | MSXML 3.0 Version | PC
Information | ops-20160228_23_35_01.gz | 2016
| 2 | 29 |
| 11.0 | 0.0
| 3.83399395E8 | EF0D03C409681B98646F316CA1088973 |
174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000 |
est | 2016-01-13T06:58:19 | 2
| GenuineIntel | CPU Vendor | PC
Information | ops-20160228_23_35_01.gz | 2016
| 2 | 29 |
| 11.0 | 0.0
| 3.83399396E8 | EF0D03C409681B98646F316CA1088973 |
174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000 |
est | 2016-01-13T06:58:19 | 141
| 4 | Screens | PC
Information | ops-20160228_23_35_01.gz | 2016
| 2 | 29 |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
3 rows selected (1.252 seconds)
Attached shows that no error or warning logs are generated by Spark.
Also the table definition is attached.
was:
I am using PySpark to read avro-based tables from Hive and while the avro
tables can be read, some of the columns are incorrectly read - showing value
"None" instead of the actual value.
> HiveContext reads avro Hive tables incorrectly
> -----------------------------------------------
>
> Key: SPARK-13572
> URL: https://issues.apache.org/jira/browse/SPARK-13572
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.5.2
> Environment: Hive 0.13.1, Spark 1.5.2
> Reporter: Zoltan Fedor
> Attachments: logs, table_definition
>
>
> I am using PySpark to read avro-based tables from Hive and while the avro
> tables can be read, some of the columns are incorrectly read - showing value
> "None" instead of the actual value.
> >>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest
> >>> where year=2016 and month=2 and day=29""")
> >>> results_df.take(3)
> [Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
> statvalue=None, displayname=None, category=None,
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
> Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
> statvalue=None, displayname=None, category=None,
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
> Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
> statvalue=None, displayname=None, category=None,
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
> Observe the "None" values at most of the fields.
> Running the same query in Hive:
> c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where
> year=2016 and month=2 and day=29 limit 3;
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> | opsconsole_ingest.kafkaoffsetgeneration | opsconsole_ingest.kafkapartition
> | opsconsole_ingest.kafkaoffset | opsconsole_ingest.uuid |
> opsconsole_ingest.mid | opsconsole_ingest.iid |
> opsconsole_ingest.product | opsconsole_ingest.utctime |
> opsconsole_ingest.statcode | opsconsole_ingest.statvalue |
> opsconsole_ingest.displayname | opsconsole_ingest.category |
> opsconsole_ingest.source_filename | opsconsole_ingest.year |
> opsconsole_ingest.month | opsconsole_ingest.day |
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> | 11.0 | 0.0
> | 3.83399394E8 | EF0D03C409681B98646F316CA1088973 |
> 174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000
> | est | 2016-01-13T06:58:19 | 8
> | 3.0 SP11 (8.110.7601.18923) | MSXML 3.0 Version |
> PC Information | ops-20160228_23_35_01.gz | 2016
> | 2 | 29 |
> | 11.0 | 0.0
> | 3.83399395E8 | EF0D03C409681B98646F316CA1088973 |
> 174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000
> | est | 2016-01-13T06:58:19 | 2
> | GenuineIntel | CPU Vendor |
> PC Information | ops-20160228_23_35_01.gz | 2016
> | 2 | 29 |
> | 11.0 | 0.0
> | 3.83399396E8 | EF0D03C409681B98646F316CA1088973 |
> 174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000
> | est | 2016-01-13T06:58:19 | 141
> | 4 | Screens |
> PC Information | ops-20160228_23_35_01.gz | 2016
> | 2 | 29 |
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> 3 rows selected (1.252 seconds)
> Attached shows that no error or warning logs are generated by Spark.
> Also the table definition is attached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]