Possible hadoop application

Kevin Corby Fri, 21 Dec 2007 07:56:58 -0800

Hello,

I am just looking into Hadoop for a possible application and was hopingto get some feedback about whether it is a good fit and how to structureit. Basically my application works like this:

1. Documents arrive, maybe as part of a web crawl or something like that.
2. Documents are indexed for searching.

3. Documents have special fields extracted and stored, for instance allcountry names might be extracted as a COUNTRY field, dates as a DATEfield, IP addresses as an IP field, etc.

4. Users run queries against the index to find matching documents.

5. Users run jobs that process some combination of the extracted fieldvalues and query terms for a (possibly large) number of documents tofind patterns, relationships, etc.


An example of #5 might be:

Find all business-country relationships that exist in this set ofdocument IDs where the previously extracted country name is within 20terms of a term matching a query of business names (not previouslyextracted or tagged): (McDonalds OR "Burger King" OR "Taco Bell" OR"Wal Mart" ...)


The output would be something like:
McDonald's - Mexico => Documents 5, 76, 100
Wal Mart - Mexico => Documents 5, 22
Wal Mart - United States => Documents 22, 43, 100, 101

I work on an existing application that functions similarly to this. Weare currently using Lucene for the search index and it functions fairlywell, but it is difficult to scale #5 to a large number of users ordocuments and have it run in a reasonably responsive way.


It seems that Hadoop might be a nice fit for this in a few places:
1) Indexing
2) Extraction of field values
3) Running of jobs to process field values / query terms

I am especially interested in #3, but I'm not quite sure how it wouldwork. How would the extracted values be stored for quick lookup bydocument ID and processing? Given that hadoop is read only, would I beforced to have many small files as new documents are added andprocessed, or can the new extractions be somehow combined with the oldones on the distributed file system?

And would it be possible to use hadoop to dig the matching query termsout of the documents, since that can also be slow?


Thanks for any feedback.

- Kevin

Possible hadoop application

Reply via email to