hi
I have few questions on hive and its use case.
1. hive-on-hadoop-20 accessing/processing data stored on hadoop-18-dfs
The actual files are on hadoop-18 dfs and then I will create external table
on hive-on-hadoop-20 with files pointing to hadoop-18-dfs.
I don't think this is possible , given hadoop version incompatibility; but
never hurts to ask
2. We download tons of urls and massage the data. The massaging goes thru
various stages. We would like to monitor these stages
so I was thinking on doing a schema like following
One Table :
url as STRING,
massage_step1 is a STRUCT
massage_step2 is a STRUCT
.
.
feature_set is ARRAY<STRING>
The STRUCT can have arrays on longs, ids, timestamps, success/failure,
reasons
Assuming tht I am correct track here :
will I able to run queries like :
q1. where massage_step1.reasons like '%Failed on fetching%'
q2. where feature_set like 'shopping'
(feature set is an array, I think I have to implement a
UDFLike for Arrays)
q3. where massage_step2.ids < 10K
q4. where count(*) as count where timestamps < 'SOME_DATE'
group by massage_step1.success = true
In short , can I query on data in the complex types like Struct, Array,
Map etc
3. Some of queries will require data from 2 or more structs and some wont.
In the above example, I keeping it one table (external table). The other
option is multiple tables: one for each massage_step.
In case of multiple tables, I will have to fire JOIN queries and in case of
single table , I will filter data using where clause
What is expensive: JOIN queries or filtering data using where clause ?
Feedback is greatly appreciated
Thanks,
Sagar