>> ...Documents are indexed for searching. >> query terms for ...
I thought inverted index will be used for your data mining application. Then, i would recommend a survey of map/reduce. (Hadoop examples are great) further references : Data mining, Document classification/categorization, Social Network Analysis, etc. ------------------------------ B. Regards, Edward yoon @ NHN, corp. Home : http://www.udanax.org > Date: Fri, 21 Dec 2007 10:50:42 -0500 > From: [EMAIL PROTECTED] > To: hadoop-user@lucene.apache.org > Subject: Possible hadoop application > > Hello, > > I am just looking into Hadoop for a possible application and was hoping > to get some feedback about whether it is a good fit and how to structure > it. Basically my application works like this: > 1. Documents arrive, maybe as part of a web crawl or something like that. > 2. Documents are indexed for searching. > 3. Documents have special fields extracted and stored, for instance all > country names might be extracted as a COUNTRY field, dates as a DATE > field, IP addresses as an IP field, etc. > 4. Users run queries against the index to find matching documents. > 5. Users run jobs that process some combination of the extracted field > values and query terms for a (possibly large) number of documents to > find patterns, relationships, etc. > > An example of #5 might be: > Find all business-country relationships that exist in this set of > document IDs where the previously extracted country name is within 20 > terms of a term matching a query of business names (not previously > extracted or tagged): (McDonalds OR "Burger King" OR "Taco Bell" OR > "Wal Mart" ...) > > The output would be something like: > McDonald's - Mexico => Documents 5, 76, 100 > Wal Mart - Mexico => Documents 5, 22 > Wal Mart - United States => Documents 22, 43, 100, 101 > > I work on an existing application that functions similarly to this. We > are currently using Lucene for the search index and it functions fairly > well, but it is difficult to scale #5 to a large number of users or > documents and have it run in a reasonably responsive way. > > It seems that Hadoop might be a nice fit for this in a few places: > 1) Indexing > 2) Extraction of field values > 3) Running of jobs to process field values / query terms > > I am especially interested in #3, but I'm not quite sure how it would > work. How would the extracted values be stored for quick lookup by > document ID and processing? Given that hadoop is read only, would I be > forced to have many small files as new documents are added and > processed, or can the new extractions be somehow combined with the old > ones on the distributed file system? > > And would it be possible to use hadoop to dig the matching query terms > out of the documents, since that can also be slow? > > Thanks for any feedback. > > - Kevin _________________________________________________________________ i’m is proud to present Cause Effect, a series about real people making a difference. http://im.live.com/Messenger/IM/MTV/?source=text_Cause_Effect