We have built basic index support in CloudBase (a data warehouse on top of Hadoop- http://cloudbase.sourceforge.net/) and can share our experience here-
The index we built is like a Hash Index- for a given column/field value, it tries to process only those data blocks which contain that value while ignoring rest of the blocks. As you would know, Hadoop stores data in the form of data blocks (64MB default size). During index creation stage ( a Map reduce job), we store the distinct column/field values along with their block information (node, filename, offset etc) in Hadoop's MapFile. A Map file is a SequenceFile (stores key,value pairs) which is sorted on keys. So you can see, it is like building an inverted index, where keys are the column/field values and posting lists are the lists containing blocks information. A Map file is not a very efficient persistent map, so you can store this inverted index in local file system using something like Lucene Index, but as your index grows you are consuming local space of Master node. Hence we decided to store it in HDFS using Map file. When you query using the column/field that was indexed, first this inverted index (MapFile) is consulted to find all the blocks that contain the desired value and InputSplits are created. As you would know a Mapper works on Input Split and in normal cases, Hadoop will schedule jobs that will work on all input splits but in this case, we have written our own InputFormat that will schedule jobs only on required input splits. We have measured performance of this approach in some cases we have got 97% improvement (we indexed on date field and ran queries on year's log files but fetching only one or few particular day(s) of data). Now some caveats- 1) If the column, field that you are indexing contain a large number of distinct values, then your inverted index (the MapFile) is gonna bloat up and this file is looked up before any Map Reduce job is started so this means this code runs on master node. This can become very slow. For example, if you have query logs and you index query column/field, then the size of Map file can become really huge. Using Lucene Index on local file sytem can speed up the lookup. 2) It does not work on range queries like column < 10 etc. To solve this, we store the min and max value of the column/field found in a block also. That is for each block, min and max value of the column/field is stored and when we encounter range query, this list is scanned to eliminate some blocks. Another indexing approaches we have thougth of- To solve problem 1) where the size of inverted index (MapFile) becomes huge, we can use an approach called BloomIndexing. This approach makes use of BloomFilters (space-efficient probabilistic data structures that is used to test whether an element is a member of a set or not). For each data block, a bloom filter is constructed. For 64MB, 128MB, 256MB data block sizes, the bloom filter will be extremely small. During index creation stage (Map job, no reducer), a mapper reads the blok/input split and creates a bloom fitler for the column/filed on which you want to create your index and stores it in HDFS. During query phase, before processing of the input split/data block first the corresponding bloom filter is consulted to see if the data block contains the desired value or not. As you would know a bloom filter will never give false negative, so if bloom test fails, you can safely ignore processing of that block. This will save you time that you would have spent processing each row/line in the data block. Advantages of this approach- As compared to HashIndexing, this apporach scatters the block selection logic in Map Reduce job so master node is not overloaded to scan huge inverted index. Disadvantages of this approach- You still have to schedule as many mappers as the number of input splits/data blocks and starting JVMs incur overheads, however since hadoop-0.19 you can use "reuse jvm flag" to avoid some overheads. Further you can increase your block size to 128 or 256MB that will give you considerable performance improvent. Hope this helps, Tarandeep On Wed, Jun 10, 2009 at 5:49 AM, kartik saxena <kartik....@gmail.com> wrote: > Hi, > > I have a huge LDIF file in order of GBs spanning some million user > records. > I am running the example "Grep" job on that file. The search results have > not really been > upto expectations because of it being a basic per line , brute force. > > I was thinking of building some indexes inside HDFS for that file , so that > the search results could improve. What could I possibly try to achieve > this? > > > Secura >