If you assume all the rows match across all of the column files (using
null values where values don't exist) it is indeed very fast and quite
simple in concept to do.. I think ease of use could be (and probably
will be) debated for a long time. I like the idea of being able to
choose which method is best for a particular table. I guess we can
discuss it when the time comes.
Josh
On Dec 14, 2008, at 6:14 PM, Joydeep Sen Sarma wrote:
Regular joins are very expensive (hive doesn’t implement either map-
side or joins optimized for pre-sorted data yet). Joining
corresponding rows of multiple files together is very cheap (it
would be linear if disks are not a bottleneck). It should be better
than merge join of multiple files (using unix join for example)
since we would just read the head of each file and concatenate ..
Ease of use is the other fact. if performance/compaction can be
transparently improved – then it’s better for everyone .. (but views
can get us there too).
Like Zheng says – disk seeks are a little bit of a wildcard (but our
cluster, for example, is not disk iops bound and with good buffering/
prefetching – one should be able to hide these costs to a large
extent). And clearly things start falling apart with a high enough
number of columns.
From: Josh Ferguson [mailto:[email protected]]
Sent: Sunday, December 14, 2008 4:26 PM
To: [email protected]
Subject: Re: OLAP with Hive
So before I was using hive I was doing this on a normal file system
using the standard unix "join" command line application, which
requires the two files to be sorted. It was indeed great when you
didn't have to deal with more than one or two columns of data at a
time.. After that the performance started degrading pretty rapidly
since the running time of doing a join of M rows N times grows
quickly.
I think what you'd want is the idea of a column family or grouping
that were always stored together so that you could keep columns you
use regularly together. The ones not used regularly could be joined
later if a query called for them. Also if you created identifiers
(by way of file name perhaps) on the column family and not the
individual columns you only have to join by family not per-column.
So this is sort of the middle-ground between having it all in one
file and all in separate files.
Of course you can sort of already do this by just having separate
tables and using the built in join functionality, so you could
implement this with implicit table names like tablename_familyname
or even by just adding an implicit partition to the file scheme.
Then you could either make the user responsible for knowing which
partitions to select from to get the proper columns or do some magic
in the background.
Anyways, I'm sure you knew all of this, I was just stating it for
the sake of the group..:)
Josh
On Dec 14, 2008, at 4:07 PM, Zheng Shao wrote:
For example, we can store one field (or a set of fields) of a table
in one directory.
So table (string a, string b) can be stored like this:
a/part-00000
a/part-00001
a/part-00002
...
b/part-00000
b/part-00001
b/part-00002
...
The columnar organization can give us a huge performance gain, when
the table is so wide, and most of the queries only access a very
small number of columns.
The drawback is when we do need to access a lot of fields, we need
to concatenate the fields backs into a row by reading multiple files
at the same time. This will increase the number of disk seeks that
we do, and some of the data need to go through the network. There
was a discussion about whether we want to co-locate the file system
blocks of the different fields of the same rows, but it's not very
clear what's the best way to do that.
Zheng
On Sun, Dec 14, 2008 at 3:51 PM, Josh Ferguson <[email protected]>
wrote:
What would columnar organization look like and what are the benefits
and drawbacks to this?
Josh
On Dec 14, 2008, at 3:13 PM, Joydeep Sen Sarma wrote:
That's a hard one. We can wish whatever we want to – but I guess
it's all a question of who has the resources to contribute to it and
what they want from Hive.
I can speak a little bit about Facebook. The reason we invested in
indexing was not that it was the primary usage (or even a bottleneck
for, say, performance optimization) – but because once you have so
much data in one place – chances are that someone will come along
and want to have quick lookups over some part of it (and u don't
want to kill ur cluster by doing scans all the time). So that
definitely makes indexing useful. We are also seeing that with
dimensional analysis – where there is a need to drill down into
detailed data – multidimensional indexes can be very useful. So in
the long term – I think this is one of the desired features.
That doesn't make it akin to hbase though (in the sense that we
still wouldn't have row level updates or real-time index updates).
Katta may be complimentary and we were actually interested in
investigating it for indexing (instead of rolling things from
scratch).
Columnar organization is also very interesting. With all the hooks
in hadoop (inputformatters) and hive(serdes) – I think it's fairly
tractable to do this ..
From: Josh Ferguson [mailto:[email protected]]
Sent: Sunday, December 14, 2008 1:20 PM
To: [email protected]
Subject: Re: OLAP with Hive
I'd honestly like to see hive remain a partitioned flat file store.
I don't think indexing what's inside the files is too incredibly
useful in most situations where you'd use hive. I also think this
kind of store is just the right fit for the hadoop and large scale
analytics situation. I don't want to see hive go toward hbase or
katta. What is the long term vision for hive?
Josh
On Dec 14, 2008, at 1:06 PM, Joydeep Sen Sarma wrote:
We have done some preliminary work with indexing – but that's not
the focus right now and no code is available in the open source
trunk for this purpose. I think it's fair to say that hive is not
optimized for online processing right now. (and we are quite some
ways off from columnar storage).
From: Martin Matula [mailto:[email protected]]
Sent: Sunday, December 14, 2008 6:54 AM
To: [email protected]
Subject: OLAP with Hive
Hi,
Is Hive capable of indexing the data and storing them in a way
optimized for querying (like a columnar database - bitmap indexes,
compression, etc.)?
I need to be able to get decent response times for queries (up to a
few seconds) over huge amounts of analytical data. Is that
achievable (with appropriate number of machines in a cluster)? I saw
the serialization/deserialization of tables is pluggable. Is that
the way to make the storage more efficient? Any existing
implementation (either ready or in progress) that would be targeted
at this? Or any hints on what I may want to take a look at among the
things that are currently available in Hive/Hadoop?
Thanks,
Martin
--
Yours,
Zheng