Re: OLAP with Hive

Josh Ferguson Sun, 14 Dec 2008 20:11:51 -0800

If you assume all the rows match across all of the column files (usingnull values where values don't exist) it is indeed very fast and quitesimple in concept to do.. I think ease of use could be (and probablywill be) debated for a long time. I like the idea of being able tochoose which method is best for a particular table. I guess we candiscuss it when the time comes.


Josh


On Dec 14, 2008, at 6:14 PM, Joydeep Sen Sarma wrote:

Regular joins are very expensive (hive doesn’t implement either map-side or joins optimized for pre-sorted data yet). Joiningcorresponding rows of multiple files together is very cheap (itwould be linear if disks are not a bottleneck). It should be betterthan merge join of multiple files (using unix join for example)since we would just read the head of each file and concatenate ..
Ease of use is the other fact. if performance/compaction can betransparently improved – then it’s better for everyone .. (but viewscan get us there too).
Like Zheng says – disk seeks are a little bit of a wildcard (but ourcluster, for example, is not disk iops bound and with good buffering/prefetching – one should be able to hide these costs to a largeextent). And clearly things start falling apart with a high enoughnumber of columns.
From: Josh Ferguson [mailto:[email protected]]
Sent: Sunday, December 14, 2008 4:26 PM
To: [email protected]
Subject: Re: OLAP with Hive
So before I was using hive I was doing this on a normal file systemusing the standard unix "join" command line application, whichrequires the two files to be sorted. It was indeed great when youdidn't have to deal with more than one or two columns of data at atime.. After that the performance started degrading pretty rapidlysince the running time of doing a join of M rows N times growsquickly.
I think what you'd want is the idea of a column family or groupingthat were always stored together so that you could keep columns youuse regularly together. The ones not used regularly could be joinedlater if a query called for them. Also if you created identifiers(by way of file name perhaps) on the column family and not theindividual columns you only have to join by family not per-column.So this is sort of the middle-ground between having it all in onefile and all in separate files.
Of course you can sort of already do this by just having separatetables and using the built in join functionality, so you couldimplement this with implicit table names like tablename_familynameor even by just adding an implicit partition to the file scheme.Then you could either make the user responsible for knowing whichpartitions to select from to get the proper columns or do some magicin the background.
Anyways, I'm sure you knew all of this, I was just stating it forthe sake of the group..:)
Josh


On Dec 14, 2008, at 4:07 PM, Zheng Shao wrote:
For example, we can store one field (or a set of fields) of a tablein one directory.
So table (string a, string b) can be stored like this:
a/part-00000
a/part-00001
a/part-00002
...
b/part-00000
b/part-00001
b/part-00002
...
The columnar organization can give us a huge performance gain, whenthe table is so wide, and most of the queries only access a verysmall number of columns.
The drawback is when we do need to access a lot of fields, we needto concatenate the fields backs into a row by reading multiple filesat the same time. This will increase the number of disk seeks thatwe do, and some of the data need to go through the network. Therewas a discussion about whether we want to co-locate the file systemblocks of the different fields of the same rows, but it's not veryclear what's the best way to do that.
Zheng
On Sun, Dec 14, 2008 at 3:51 PM, Josh Ferguson <[email protected]>wrote:What would columnar organization look like and what are the benefitsand drawbacks to this?
Josh

On Dec 14, 2008, at 3:13 PM, Joydeep Sen Sarma wrote:
That's a hard one. We can wish whatever we want to – but I guessit's all a question of who has the resources to contribute to it andwhat they want from Hive.
I can speak a little bit about Facebook. The reason we invested inindexing was not that it was the primary usage (or even a bottleneckfor, say, performance optimization) – but because once you have somuch data in one place – chances are that someone will come alongand want to have quick lookups over some part of it (and u don'twant to kill ur cluster by doing scans all the time). So thatdefinitely makes indexing useful. We are also seeing that withdimensional analysis – where there is a need to drill down intodetailed data – multidimensional indexes can be very useful. So inthe long term – I think this is one of the desired features.
That doesn't make it akin to hbase though (in the sense that westill wouldn't have row level updates or real-time index updates).Katta may be complimentary and we were actually interested ininvestigating it for indexing (instead of rolling things fromscratch).
Columnar organization is also very interesting. With all the hooksin hadoop (inputformatters) and hive(serdes) – I think it's fairlytractable to do this ..
From: Josh Ferguson [mailto:[email protected]]
Sent: Sunday, December 14, 2008 1:20 PM
To: [email protected]
Subject: Re: OLAP with Hive
I'd honestly like to see hive remain a partitioned flat file store.I don't think indexing what's inside the files is too incrediblyuseful in most situations where you'd use hive. I also think thiskind of store is just the right fit for the hadoop and large scaleanalytics situation. I don't want to see hive go toward hbase orkatta. What is the long term vision for hive?
Josh

On Dec 14, 2008, at 1:06 PM, Joydeep Sen Sarma wrote:
We have done some preliminary work with indexing – but that's notthe focus right now and no code is available in the open sourcetrunk for this purpose. I think it's fair to say that hive is notoptimized for online processing right now. (and we are quite someways off from columnar storage).
From: Martin Matula [mailto:[email protected]]
Sent: Sunday, December 14, 2008 6:54 AM
To: [email protected]
Subject: OLAP with Hive

Hi,
Is Hive capable of indexing the data and storing them in a wayoptimized for querying (like a columnar database - bitmap indexes,compression, etc.)?I need to be able to get decent response times for queries (up to afew seconds) over huge amounts of analytical data. Is thatachievable (with appropriate number of machines in a cluster)? I sawthe serialization/deserialization of tables is pluggable. Is thatthe way to make the storage more efficient? Any existingimplementation (either ready or in progress) that would be targetedat this? Or any hints on what I may want to take a look at among thethings that are currently available in Hive/Hadoop?
Thanks,
Martin





--
Yours,
Zheng

Re: OLAP with Hive

Reply via email to