I believe it is pretty common practice to build Lucene indexes from HDFS content (e.g. tab files), or HBase tables.
I have been exploring Lucene indexes to Columns held in HBase, where the Lucene index indexes fields, but stores only the Key to the HBase row. More recently I have been using Hive for report running. I anticipate in future our architecture looking like: - harvesters (crawlers) / loaders to HBase - Hive on HBase tables for ad-hoc reporting (possibly using secondary indexes somehow to reduce to less than full table scan for common reports) - Lucene for real time search HBase has a couple of indexing projects. One uses secondary tables, and another I see is going for a Lucene index. We are progressing off Mysql for everything with tables of 200million+ records (Hive is really performing fantastically for building DBs offline with 200million to 20 million joins - the results still loaded into Mysql tho' for now) Cheers, Tim On Tue, Mar 16, 2010 at 3:12 PM, Edward Capriolo <[email protected]>wrote: > > > On Tue, Mar 16, 2010 at 8:36 AM, prasenjit mukherjee <[email protected] > > wrote: > >> forwarding to hdfs and pig mailing-lists for responses from wider >> audience. >> >> >> ---------- Forwarded message ---------- >> From: prasenjit mukherjee <[email protected]> >> Date: Tue, Mar 16, 2010 at 11:47 AM >> Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE) >> To: hive-user <[email protected]> >> >> >> Is there a way to avoid full table scan for an arbitrary where-clause >> usage ? partitioning/bucketing makes sense only when you know which >> columns will be searched upon. I was wondering if there is any project >> which combines the SQL-like features of HIVE and inverted-index like >> search-features of LUCENE, and works on cloud. Guess I am asking for >> too much :( >> >> I have been using oracle till now and my usage is mainly restricted to >> do summation-type queries with some where clause, example being : >> "Select SUM(column1) where col2='foo' AND col3='bar'". >> The output is always some aggregation and where clauses can include >> "<, >, =, IN". I would like to use some kind of distributed >> processing to speed up the table generation, search query-time. Hive ( >> and to some extent Pig ) seems to be the closest tool available to >> what I am looking for. I am also exploring hbase, but not sure whether >> it will be the right choice for my problem. >> >> Hive can definitely help in parallelizing up the search-processing. >> But my main concern is whether hive does ( or plans to do ) any >> storage optimization like oracle,lucene ( apart from simple >> partitioning/bucketing ). It seems that all the hadoop-options ( >> hive,pig,hbase) will have to do an entire table scan. >> >> Appreciate any suggestions/feedback.. >> >> -thanks, >> Prasen >> > > There is a Jira open with hive to build indexes that will optimize some > types of queries. > > https://cwiki.apache.org/jira/browse/HIVE-417 >
