Hello,
I am just looking into Hadoop for a possible application and was hoping
to get some feedback about whether it is a good fit and how to structure
it. Basically my application works like this:
1. Documents arrive, maybe as part of a web crawl or something like that.
2. Documents are indexed for searching.
3. Documents have special fields extracted and stored, for instance all
country names might be extracted as a COUNTRY field, dates as a DATE
field, IP addresses as an IP field, etc.
4. Users run queries against the index to find matching documents.
5. Users run jobs that process some combination of the extracted field
values and query terms for a (possibly large) number of documents to
find patterns, relationships, etc.
An example of #5 might be:
Find all business-country relationships that exist in this set of
document IDs where the previously extracted country name is within 20
terms of a term matching a query of business names (not previously
extracted or tagged): (McDonalds OR "Burger King" OR "Taco Bell" OR
"Wal Mart" ...)
The output would be something like:
McDonald's - Mexico => Documents 5, 76, 100
Wal Mart - Mexico => Documents 5, 22
Wal Mart - United States => Documents 22, 43, 100, 101
I work on an existing application that functions similarly to this. We
are currently using Lucene for the search index and it functions fairly
well, but it is difficult to scale #5 to a large number of users or
documents and have it run in a reasonably responsive way.
It seems that Hadoop might be a nice fit for this in a few places:
1) Indexing
2) Extraction of field values
3) Running of jobs to process field values / query terms
I am especially interested in #3, but I'm not quite sure how it would
work. How would the extracted values be stored for quick lookup by
document ID and processing? Given that hadoop is read only, would I be
forced to have many small files as new documents are added and
processed, or can the new extractions be somehow combined with the old
ones on the distributed file system?
And would it be possible to use hadoop to dig the matching query terms
out of the documents, since that can also be slow?
Thanks for any feedback.
- Kevin