Re: Questions for the future work of Hive

Zheng Shao Wed, 05 Aug 2009 00:27:14 -0700

1) We have not started working on cost-based optimizer yet. Index is
one of the ongoing works on the performance side. We are working on a
couple more, e.g. more compact on-disk format (LazyBinarySerDe
https://issues.apache.org/jira/browse/HIVE-640 ) which gives a nice
speed-up for queries with multiple map-reduce jobs.

2) We don't have a short-term plan for automatic-multi-partition
insertion. However there is a simple workaround if you know the
partition values (and Hive can do multiple inserts in a single
map-reduce job!). "src" can be a sub query as well.
FROM src
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
ts = "2009-08-01"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
ts = "2009-08-02"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-03") SELECT * WHERE
ts = "2009-08-03"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-04") SELECT * WHERE
ts = "2009-08-04"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-05") SELECT * WHERE
ts = "2009-08-05"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-06") SELECT * WHERE
ts = "2009-08-06"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-07") SELECT * WHERE
ts = "2009-08-07";

There is some ongoing work for integrating HBase tables with Hive:
https://issues.apache.org/jira/browse/HIVE-705
We won't know which storage backend is the best until we have them
done and tested, but at the least HBase looks very promising for
datasets that fit in the memory.

Here is the slides which contains examples for how to add new storage
backend (file format) to Hive:
http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook page
Hive is completely open and we hope Hive can have more storage
backends, because it's not likely that one storage backend will be the
best for all kinds of applications.

Zheng

On Wed, Aug 5, 2009 at 12:06 AM, Schubert Zhang<[email protected]> wrote:
> In the Hive paper <Hive - A Warehousing Solution Over a MapReduce
> Framework>, the section 5 describes the FUTURE WORK of Hive. I want to get
> more detail of following tow points:
> (1) Hive currently has a naive rule-based optimizer with a small number of
> simple rules. We plan to build a cost-based optimizer and adaptive
> optimization techniques to come up with more efficient plans.
> Q: Is the ongoing work of "Indexing" the one of this improvement?
> Q: Is there any more?
> (2) We are exploring columnar storage and more intelligent data placement to
> improve scan performance.
> Q: We found that current Hive cannot place the data in different partitions
> intelligently (we must specify the partition value in statements). Is the
> intelligent/dynamic placement of partitions is one of this improvement? For
> example, we have many input files which contain many records for diffenent
> timestamp, and we want place each record into a proper partition according
> to the timestamp colum.
> Q: Do you think Bigtable/HBase is a good columnar storage which provides
> good model of intelligent data placement?
> Schubert

-- 
Yours,
Zheng

Re: Questions for the future work of Hive

Reply via email to