Hello,

I am just looking into Hadoop for a possible application and was hoping to get some feedback about whether it is a good fit and how to structure it. Basically my application works like this:
1. Documents arrive, maybe as part of a web crawl or something like that.
2. Documents are indexed for searching.
3. Documents have special fields extracted and stored, for instance all country names might be extracted as a COUNTRY field, dates as a DATE field, IP addresses as an IP field, etc.
4. Users run queries against the index to find matching documents.
5. Users run jobs that process some combination of the extracted field values and query terms for a (possibly large) number of documents to find patterns, relationships, etc.

An example of #5 might be:
Find all business-country relationships that exist in this set of document IDs where the previously extracted country name is within 20 terms of a term matching a query of business names (not previously extracted or tagged): (McDonalds OR "Burger King" OR "Taco Bell" OR "Wal Mart" ...)

The output would be something like:
McDonald's - Mexico => Documents 5, 76, 100
Wal Mart - Mexico => Documents 5, 22
Wal Mart - United States => Documents 22, 43, 100, 101

I work on an existing application that functions similarly to this. We are currently using Lucene for the search index and it functions fairly well, but it is difficult to scale #5 to a large number of users or documents and have it run in a reasonably responsive way.

It seems that Hadoop might be a nice fit for this in a few places:
1) Indexing
2) Extraction of field values
3) Running of jobs to process field values / query terms

I am especially interested in #3, but I'm not quite sure how it would work. How would the extracted values be stored for quick lookup by document ID and processing? Given that hadoop is read only, would I be forced to have many small files as new documents are added and processed, or can the new extractions be somehow combined with the old ones on the distributed file system?

And would it be possible to use hadoop to dig the matching query terms out of the documents, since that can also be slow?

Thanks for any feedback.

- Kevin

Reply via email to