[ 
https://issues.apache.org/jira/browse/SPARK-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Fedor updated SPARK-13572:
---------------------------------
    Description: 
I am using PySpark to read avro-based tables from Hive and while the avro 
tables can be read, some of the columns are incorrectly read - showing value 
"None" instead of the actual value.

>>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest 
>>> where year=2016 and month=2 and day=29""")
>>> results_df.take(3)
[Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
statvalue=None, displayname=None, category=None, 
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
 Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
statvalue=None, displayname=None, category=None, 
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
 Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
statvalue=None, displayname=None, category=None, 
source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]

Observe the "None" values at most of the fields.

Running the same query in Hive:
c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where 
year=2016 and month=2 and day=29 limit 3;
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| opsconsole_ingest.kafkaoffsetgeneration  | opsconsole_ingest.kafkapartition  
| opsconsole_ingest.kafkaoffset  |      opsconsole_ingest.uuid       |         
opsconsole_ingest.mid         |         opsconsole_ingest.iid         | 
opsconsole_ingest.product  | opsconsole_ingest.utctime  | 
opsconsole_ingest.statcode  | opsconsole_ingest.statvalue  | 
opsconsole_ingest.displayname  | opsconsole_ingest.category  | 
opsconsole_ingest.source_filename  | opsconsole_ingest.year  | 
opsconsole_ingest.month  | opsconsole_ingest.day  |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| 11.0                                     | 0.0                               
| 3.83399394E8                   | EF0D03C409681B98646F316CA1088973  | 
174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | 
est                        | 2016-01-13T06:58:19        | 8                     
      | 3.0 SP11 (8.110.7601.18923)  | MSXML 3.0 Version              | PC 
Information              | ops-20160228_23_35_01.gz           | 2016            
        | 2                        | 29                     |
| 11.0                                     | 0.0                               
| 3.83399395E8                   | EF0D03C409681B98646F316CA1088973  | 
174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | 
est                        | 2016-01-13T06:58:19        | 2                     
      | GenuineIntel                 | CPU Vendor                     | PC 
Information              | ops-20160228_23_35_01.gz           | 2016            
        | 2                        | 29                     |
| 11.0                                     | 0.0                               
| 3.83399396E8                   | EF0D03C409681B98646F316CA1088973  | 
174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | 
est                        | 2016-01-13T06:58:19        | 141                   
      | 4                            | Screens                        | PC 
Information              | ops-20160228_23_35_01.gz           | 2016            
        | 2                        | 29                     |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
3 rows selected (1.252 seconds)

Attached shows that no error or warning logs are generated by Spark.
Also the table definition is attached.


  was:
I am using PySpark to read avro-based tables from Hive and while the avro 
tables can be read, some of the columns are incorrectly read - showing value 
"None" instead of the actual value.




> HiveContext reads avro Hive tables incorrectly 
> -----------------------------------------------
>
>                 Key: SPARK-13572
>                 URL: https://issues.apache.org/jira/browse/SPARK-13572
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.5.2
>         Environment: Hive 0.13.1, Spark 1.5.2
>            Reporter: Zoltan Fedor
>         Attachments: logs, table_definition
>
>
> I am using PySpark to read avro-based tables from Hive and while the avro 
> tables can be read, some of the columns are incorrectly read - showing value 
> "None" instead of the actual value.
> >>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest 
> >>> where year=2016 and month=2 and day=29""")
> >>> results_df.take(3)
> [Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
>  Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
>  Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
> Observe the "None" values at most of the fields.
> Running the same query in Hive:
> c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where 
> year=2016 and month=2 and day=29 limit 3;
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> | opsconsole_ingest.kafkaoffsetgeneration  | opsconsole_ingest.kafkapartition 
>  | opsconsole_ingest.kafkaoffset  |      opsconsole_ingest.uuid       |       
>   opsconsole_ingest.mid         |         opsconsole_ingest.iid         | 
> opsconsole_ingest.product  | opsconsole_ingest.utctime  | 
> opsconsole_ingest.statcode  | opsconsole_ingest.statvalue  | 
> opsconsole_ingest.displayname  | opsconsole_ingest.category  | 
> opsconsole_ingest.source_filename  | opsconsole_ingest.year  | 
> opsconsole_ingest.month  | opsconsole_ingest.day  |
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> | 11.0                                     | 0.0                              
>  | 3.83399394E8                   | EF0D03C409681B98646F316CA1088973  | 
> 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  
> | est                        | 2016-01-13T06:58:19        | 8                 
>           | 3.0 SP11 (8.110.7601.18923)  | MSXML 3.0 Version              | 
> PC Information              | ops-20160228_23_35_01.gz           | 2016       
>              | 2                        | 29                     |
> | 11.0                                     | 0.0                              
>  | 3.83399395E8                   | EF0D03C409681B98646F316CA1088973  | 
> 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  
> | est                        | 2016-01-13T06:58:19        | 2                 
>           | GenuineIntel                 | CPU Vendor                     | 
> PC Information              | ops-20160228_23_35_01.gz           | 2016       
>              | 2                        | 29                     |
> | 11.0                                     | 0.0                              
>  | 3.83399396E8                   | EF0D03C409681B98646F316CA1088973  | 
> 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  
> | est                        | 2016-01-13T06:58:19        | 141               
>           | 4                            | Screens                        | 
> PC Information              | ops-20160228_23_35_01.gz           | 2016       
>              | 2                        | 29                     |
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> 3 rows selected (1.252 seconds)
> Attached shows that no error or warning logs are generated by Spark.
> Also the table definition is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to