You might want to look at CouchDB for this. It is stronger on the
query side of things right now and has a similar model.
--
Toby DiPasquale
Software Assassin
On Dec 21, 2007, at 10:50, Kevin Corby <[EMAIL PROTECTED]> wrote:
Hello,
I am just looking into Hadoop for a possible application and was
hoping to get some feedback about whether it is a good fit and how
to structure it. Basically my application works like this:
1. Documents arrive, maybe as part of a web crawl or something like
that.
2. Documents are indexed for searching.
3. Documents have special fields extracted and stored, for instance
all country names might be extracted as a COUNTRY field, dates as a
DATE field, IP addresses as an IP field, etc.
4. Users run queries against the index to find matching documents.
5. Users run jobs that process some combination of the extracted
field values and query terms for a (possibly large) number of
documents to find patterns, relationships, etc.
An example of #5 might be:
Find all business-country relationships that exist in this set of
document IDs where the previously extracted country name is within
20 terms of a term matching a query of business names (not
previously extracted or tagged): (McDonalds OR "Burger King" OR
"Taco Bell" OR "Wal Mart" ...)
The output would be something like:
McDonald's - Mexico => Documents 5, 76, 100
Wal Mart - Mexico => Documents 5, 22
Wal Mart - United States => Documents 22, 43, 100, 101
I work on an existing application that functions similarly to this.
We are currently using Lucene for the search index and it functions
fairly well, but it is difficult to scale #5 to a large number of
users or documents and have it run in a reasonably responsive way.
It seems that Hadoop might be a nice fit for this in a few places:
1) Indexing
2) Extraction of field values
3) Running of jobs to process field values / query terms
I am especially interested in #3, but I'm not quite sure how it
would work. How would the extracted values be stored for quick
lookup by document ID and processing? Given that hadoop is read
only, would I be forced to have many small files as new documents
are added and processed, or can the new extractions be somehow
combined with the old ones on the distributed file system?
And would it be possible to use hadoop to dig the matching query
terms out of the documents, since that can also be slow?
Thanks for any feedback.
- Kevin