Er, lost the end of my first sentence: "analytical data [that are expected
to return in seconds]"

On Sun, Dec 14, 2008 at 3:14 PM, Jeff Hammerbacher <[email protected]>wrote:

> Hey Martin,
>
> MapReduce was designed to perform serial scans over large quantities of
> data, so it's not clear that the programming paradigm will prove useful for
> queries over "huge" amounts of analytical data. You'll most likely need to
> materialize frequent and complex queries and get creative with table
> statistics and indexing to achieve your purposes.
>
> That being said, the Hive query language was designed to naturally
> incorporate Map and Reduce code written in any language via Hadoop
> Streaming. See the "Custom Map Reduce Scripts" section of
> http://wiki.apache.org/hadoop/Hive/HiveQL. Note that the syntax is
> evolving and may change, as indicated in
> https://issues.apache.org/jira/browse/HIVE-37.
>
> Regards,
> Jeff
>
>
> On Sun, Dec 14, 2008 at 1:32 PM, Martin Matula <[email protected]> wrote:
>
>> So is Hive not suitable for this at all and I should rather look for
>> something else or maybe try to build something on top of Hadoop and HBase?
>> Or would it make any sense for me to check out the pluggable serialization
>> in Hive? I am assuming I would also need to add my own MapReduce routines so
>> the value of using Hive would be questionable, right?
>> Thanks,
>> Martin
>>
>>
>> On Sun, Dec 14, 2008 at 10:06 PM, Joydeep Sen Sarma <[email protected]
>> > wrote:
>>
>>>  We have done some preliminary work with indexing – but that's not the
>>> focus right now and no code is available in the open source trunk for this
>>> purpose. I think it's fair to say that hive is not optimized for online
>>> processing right now. (and we are quite some ways off from columnar
>>> storage).
>>>
>>>
>>>  ------------------------------
>>>
>>> *From:* Martin Matula [mailto:[email protected]]
>>> *Sent:* Sunday, December 14, 2008 6:54 AM
>>> *To:* [email protected]
>>> *Subject:* OLAP with Hive
>>>
>>>
>>>
>>> Hi,
>>> Is Hive capable of indexing the data and storing them in a way optimized
>>> for querying (like a columnar database - bitmap indexes, compression, etc.)?
>>> I need to be able to get decent response times for queries (up to a few
>>> seconds) over huge amounts of analytical data. Is that achievable (with
>>> appropriate number of machines in a cluster)? I saw the
>>> serialization/deserialization of tables is pluggable. Is that the way to
>>> make the storage more efficient? Any existing implementation (either ready
>>> or in progress) that would be targeted at this? Or any hints on what I may
>>> want to take a look at among the things that are currently available in
>>> Hive/Hadoop?
>>> Thanks,
>>> Martin
>>>
>>
>>
>

Reply via email to