That's a hard one. We can wish whatever we want to - but I guess it's all a 
question of who has the resources to contribute to it and what they want from 
Hive.

I can speak a little bit about Facebook. The reason we invested in indexing was 
not that it was the primary usage (or even a bottleneck for, say, performance 
optimization) - but because once you have so much data in one place - chances 
are that someone will come along and want to have quick lookups over some part 
of it (and u don't want to kill ur cluster by doing scans all the time). So 
that definitely makes indexing useful. We are also seeing that with dimensional 
analysis - where there is a need to drill down into detailed data - 
multidimensional indexes can be very useful. So in the long term - I think this 
is one of the desired features.

That doesn't make it akin to hbase though (in the sense that we still wouldn't 
have row level updates or real-time index updates). Katta may be complimentary 
and we were actually interested in investigating it for indexing (instead of 
rolling things from scratch).

Columnar organization is also very interesting. With all the hooks in hadoop 
(inputformatters) and hive(serdes) - I think it's fairly tractable to do this ..

________________________________
From: Josh Ferguson [mailto:[email protected]]
Sent: Sunday, December 14, 2008 1:20 PM
To: [email protected]
Subject: Re: OLAP with Hive

I'd honestly like to see hive remain a partitioned flat file store. I don't 
think indexing what's inside the files is too incredibly useful in most 
situations where you'd use hive. I also think this kind of store is just the 
right fit for the hadoop and large scale analytics situation. I don't want to 
see hive go toward hbase or katta. What is the long term vision for hive?

Josh

On Dec 14, 2008, at 1:06 PM, Joydeep Sen Sarma wrote:


We have done some preliminary work with indexing - but that's not the focus 
right now and no code is available in the open source trunk for this purpose. I 
think it's fair to say that hive is not optimized for online processing right 
now. (and we are quite some ways off from columnar storage).

________________________________
From: Martin Matula [mailto:[email protected]]
Sent: Sunday, December 14, 2008 6:54 AM
To: [email protected]<mailto:[email protected]>
Subject: OLAP with Hive

Hi,
Is Hive capable of indexing the data and storing them in a way optimized for 
querying (like a columnar database - bitmap indexes, compression, etc.)?
I need to be able to get decent response times for queries (up to a few 
seconds) over huge amounts of analytical data. Is that achievable (with 
appropriate number of machines in a cluster)? I saw the 
serialization/deserialization of tables is pluggable. Is that the way to make 
the storage more efficient? Any existing implementation (either ready or in 
progress) that would be targeted at this? Or any hints on what I may want to 
take a look at among the things that are currently available in Hive/Hadoop?
Thanks,
Martin

Reply via email to