[
https://issues.apache.org/jira/browse/SPARK-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltan Fedor updated SPARK-13572:
---------------------------------
Description:
I am using PySpark to read avro-based tables from Hive and while the avro
tables can be read, some of the columns are incorrectly read - showing value
"None" instead of the actual value.
>>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest
>>> where year=2016 and month=2 and day=29 limit 3""")
>>> results_df.take(3)
[Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
statvalue=None, displayname=None, category=None,
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
statvalue=None, displayname=None, category=None,
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
statvalue=None, displayname=None, category=None,
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
Observe the "None" values at most of the fields. Surprisingly not all fields,
only some of them are showing "None" instead of the real values. The table
definition does not show anything specific about these columns.
Running the same query in Hive:
c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where
year=2016 and month=2 and day=29 limit 3;
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| opsconsole_ingest.kafkaoffsetgeneration | opsconsole_ingest.kafkapartition
| opsconsole_ingest.kafkaoffset | opsconsole_ingest.uuid |
opsconsole_ingest.mid | opsconsole_ingest.iid |
opsconsole_ingest.product | opsconsole_ingest.utctime |
opsconsole_ingest.statcode | opsconsole_ingest.statvalue |
opsconsole_ingest.displayname | opsconsole_ingest.category |
opsconsole_ingest.source_filename | opsconsole_ingest.year |
opsconsole_ingest.month | opsconsole_ingest.day |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| 11.0 | 0.0
| 3.83399394E8 | EF0D03C409681B98646F316CA1088973 |
174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000 |
est | 2016-01-13T06:58:19 | 8
| 3.0 SP11 (8.110.7601.18923) | MSXML 3.0 Version | PC
Information | ops-20160228_23_35_01.gz | 2016
| 2 | 29 |
| 11.0 | 0.0
| 3.83399395E8 | EF0D03C409681B98646F316CA1088973 |
174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000 |
est | 2016-01-13T06:58:19 | 2
| GenuineIntel | CPU Vendor | PC
Information | ops-20160228_23_35_01.gz | 2016
| 2 | 29 |
| 11.0 | 0.0
| 3.83399396E8 | EF0D03C409681B98646F316CA1088973 |
174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000 |
est | 2016-01-13T06:58:19 | 141
| 4 | Screens | PC
Information | ops-20160228_23_35_01.gz | 2016
| 2 | 29 |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
3 rows selected (1.252 seconds)
Attached shows that no error or warning logs are generated by Spark.
Also the table definition is attached.
was:
I am using PySpark to read avro-based tables from Hive and while the avro
tables can be read, some of the columns are incorrectly read - showing value
"None" instead of the actual value.
>>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest
>>> where year=2016 and month=2 and day=29 limit 3""")
>>> results_df.take(3)
[Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
statvalue=None, displayname=None, category=None,
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
statvalue=None, displayname=None, category=None,
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
statvalue=None, displayname=None, category=None,
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
Observe the "None" values at most of the fields.
Running the same query in Hive:
c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where
year=2016 and month=2 and day=29 limit 3;
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| opsconsole_ingest.kafkaoffsetgeneration | opsconsole_ingest.kafkapartition
| opsconsole_ingest.kafkaoffset | opsconsole_ingest.uuid |
opsconsole_ingest.mid | opsconsole_ingest.iid |
opsconsole_ingest.product | opsconsole_ingest.utctime |
opsconsole_ingest.statcode | opsconsole_ingest.statvalue |
opsconsole_ingest.displayname | opsconsole_ingest.category |
opsconsole_ingest.source_filename | opsconsole_ingest.year |
opsconsole_ingest.month | opsconsole_ingest.day |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| 11.0 | 0.0
| 3.83399394E8 | EF0D03C409681B98646F316CA1088973 |
174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000 |
est | 2016-01-13T06:58:19 | 8
| 3.0 SP11 (8.110.7601.18923) | MSXML 3.0 Version | PC
Information | ops-20160228_23_35_01.gz | 2016
| 2 | 29 |
| 11.0 | 0.0
| 3.83399395E8 | EF0D03C409681B98646F316CA1088973 |
174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000 |
est | 2016-01-13T06:58:19 | 2
| GenuineIntel | CPU Vendor | PC
Information | ops-20160228_23_35_01.gz | 2016
| 2 | 29 |
| 11.0 | 0.0
| 3.83399396E8 | EF0D03C409681B98646F316CA1088973 |
174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000 |
est | 2016-01-13T06:58:19 | 141
| 4 | Screens | PC
Information | ops-20160228_23_35_01.gz | 2016
| 2 | 29 |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
3 rows selected (1.252 seconds)
Attached shows that no error or warning logs are generated by Spark.
Also the table definition is attached.
> HiveContext reads avro Hive tables incorrectly
> -----------------------------------------------
>
> Key: SPARK-13572
> URL: https://issues.apache.org/jira/browse/SPARK-13572
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.5.2
> Environment: Hive 0.13.1, Spark 1.5.2
> Reporter: Zoltan Fedor
> Attachments: logs, table_definition
>
>
> I am using PySpark to read avro-based tables from Hive and while the avro
> tables can be read, some of the columns are incorrectly read - showing value
> "None" instead of the actual value.
> >>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest
> >>> where year=2016 and month=2 and day=29 limit 3""")
> >>> results_df.take(3)
> [Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
> statvalue=None, displayname=None, category=None,
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
> Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
> statvalue=None, displayname=None, category=None,
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
> Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None,
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None,
> statvalue=None, displayname=None, category=None,
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
> Observe the "None" values at most of the fields. Surprisingly not all fields,
> only some of them are showing "None" instead of the real values. The table
> definition does not show anything specific about these columns.
> Running the same query in Hive:
> c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where
> year=2016 and month=2 and day=29 limit 3;
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> | opsconsole_ingest.kafkaoffsetgeneration | opsconsole_ingest.kafkapartition
> | opsconsole_ingest.kafkaoffset | opsconsole_ingest.uuid |
> opsconsole_ingest.mid | opsconsole_ingest.iid |
> opsconsole_ingest.product | opsconsole_ingest.utctime |
> opsconsole_ingest.statcode | opsconsole_ingest.statvalue |
> opsconsole_ingest.displayname | opsconsole_ingest.category |
> opsconsole_ingest.source_filename | opsconsole_ingest.year |
> opsconsole_ingest.month | opsconsole_ingest.day |
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> | 11.0 | 0.0
> | 3.83399394E8 | EF0D03C409681B98646F316CA1088973 |
> 174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000
> | est | 2016-01-13T06:58:19 | 8
> | 3.0 SP11 (8.110.7601.18923) | MSXML 3.0 Version |
> PC Information | ops-20160228_23_35_01.gz | 2016
> | 2 | 29 |
> | 11.0 | 0.0
> | 3.83399395E8 | EF0D03C409681B98646F316CA1088973 |
> 174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000
> | est | 2016-01-13T06:58:19 | 2
> | GenuineIntel | CPU Vendor |
> PC Information | ops-20160228_23_35_01.gz | 2016
> | 2 | 29 |
> | 11.0 | 0.0
> | 3.83399396E8 | EF0D03C409681B98646F316CA1088973 |
> 174f53fb-ca9b-d3f9-64e1-7631bf906817 | 00000000-0000-0000-0000-000000000000
> | est | 2016-01-13T06:58:19 | 141
> | 4 | Screens |
> PC Information | ops-20160228_23_35_01.gz | 2016
> | 2 | 29 |
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> 3 rows selected (1.252 seconds)
> Attached shows that no error or warning logs are generated by Spark.
> Also the table definition is attached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]