Re: Possible hadoop application

Toby DiPasquale Fri, 21 Dec 2007 09:10:21 -0800

You might want to look at CouchDB for this. It is stronger on thequery side of things right now and has a similar model.


--
Toby DiPasquale
Software Assassin


On Dec 21, 2007, at 10:50, Kevin Corby <[EMAIL PROTECTED]> wrote:

Hello,
I am just looking into Hadoop for a possible application and washoping to get some feedback about whether it is a good fit and howto structure it. Basically my application works like this:1. Documents arrive, maybe as part of a web crawl or something likethat.
2. Documents are indexed for searching.
3. Documents have special fields extracted and stored, for instanceall country names might be extracted as a COUNTRY field, dates as aDATE field, IP addresses as an IP field, etc.
4. Users run queries against the index to find matching documents.
5. Users run jobs that process some combination of the extractedfield values and query terms for a (possibly large) number ofdocuments to find patterns, relationships, etc.
An example of #5 might be:
Find all business-country relationships that exist in this set ofdocument IDs where the previously extracted country name is within20 terms of a term matching a query of business names (notpreviously extracted or tagged): (McDonalds OR "Burger King" OR"Taco Bell" OR "Wal Mart" ...)
The output would be something like:
McDonald's - Mexico => Documents 5, 76, 100
Wal Mart - Mexico => Documents 5, 22
Wal Mart - United States => Documents 22, 43, 100, 101
I work on an existing application that functions similarly to this.We are currently using Lucene for the search index and it functionsfairly well, but it is difficult to scale #5 to a large number ofusers or documents and have it run in a reasonably responsive way.
It seems that Hadoop might be a nice fit for this in a few places:
1) Indexing
2) Extraction of field values
3) Running of jobs to process field values / query terms
I am especially interested in #3, but I'm not quite sure how itwould work. How would the extracted values be stored for quicklookup by document ID and processing? Given that hadoop is readonly, would I be forced to have many small files as new documentsare added and processed, or can the new extractions be somehowcombined with the old ones on the distributed file system?
And would it be possible to use hadoop to dig the matching queryterms out of the documents, since that can also be slow?
Thanks for any feedback.

- Kevin

Re: Possible hadoop application

Reply via email to