On Tue, Mar 16, 2010 at 8:36 AM, prasenjit mukherjee <prasen....@gmail.com>wrote:
> forwarding to hdfs and pig mailing-lists for responses from wider > audience. > > > ---------- Forwarded message ---------- > From: prasenjit mukherjee <prasen....@gmail.com> > Date: Tue, Mar 16, 2010 at 11:47 AM > Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE) > To: hive-user <hive-u...@hadoop.apache.org> > > > Is there a way to avoid full table scan for an arbitrary where-clause > usage ? partitioning/bucketing makes sense only when you know which > columns will be searched upon. I was wondering if there is any project > which combines the SQL-like features of HIVE and inverted-index like > search-features of LUCENE, and works on cloud. Guess I am asking for > too much :( > > I have been using oracle till now and my usage is mainly restricted to > do summation-type queries with some where clause, example being : > "Select SUM(column1) where col2='foo' AND col3='bar'". > The output is always some aggregation and where clauses can include > "<, >, =, IN". I would like to use some kind of distributed > processing to speed up the table generation, search query-time. Hive ( > and to some extent Pig ) seems to be the closest tool available to > what I am looking for. I am also exploring hbase, but not sure whether > it will be the right choice for my problem. > > Hive can definitely help in parallelizing up the search-processing. > But my main concern is whether hive does ( or plans to do ) any > storage optimization like oracle,lucene ( apart from simple > partitioning/bucketing ). It seems that all the hadoop-options ( > hive,pig,hbase) will have to do an entire table scan. > > Appreciate any suggestions/feedback.. > > -thanks, > Prasen > There is a Jira open with hive to build indexes that will optimize some types of queries. https://cwiki.apache.org/jira/browse/HIVE-417