I'm having a little difficulty totally understanding your requirements,
but let me take a stab.
You basically want a mapping from 1 to N QUESTIONS to a single ANSWER?
When a new question comes in, you run an MR job that scans all existing
questions and does some kind of similarity metric against them to try to
find existing matches, and if one is found, add the new question to the
list of questions for that answer, and return the answer.
The first big question I have is, are you expecting this
question-matching query to be done in real-time? Or this is an offline,
batch process? Remember, MapReduce is not for real-time queries. At
the low end, for simple jobs, you will always run for several seconds if
not tens of seconds (for VERY simple jobs).
But it seems like you would need to scan the entire table, and run
something like a cosine similarity against every single question in it.
That's going to be a much longer running job, depending on how many
questions already exist, and certainly not real-time.
As for actually storing the questions, you should create two column
families "questions" and "answer". For each question, you insert a
column into the "questions" family. The "answer" family would always
have a single column (only a single answer right?). Then you can very
easily query for all questions, and they will be grouped by row (I'm not
sure what your row key will be).
You didn't talk much about how you plan on doing dupe-detection of
questions, but there are some interesting ways to generate signatures
which could turn into your row keys, then you could actually do some
kind of online duplicate detecting of already answered questions.
That's beyond the scope of this mailing list, however.
Hope that helps. If you need more help, please provide more detail.
JG
Puri, Aseem wrote:
Hello
I am working on a model in which I have to manage question and their
answers.
I create two columns, one in which question is to be store and other its
answer.
Now people will ask question, so when a new question come I want to
execute map reduce job which find is same kind of question is already
exist or not.
If same question is asked then with map reduce I will find similar
question that exist and provide answer to him that is already there with
it. Also I want to append it with the similar question that is already
their in my table.
If question is different then I will store it in different row and its
answer will be given by some expert and be stored.
I know Hadoop HBase have property write once read many times. So I can't
append it.
I have two other options.
1. Manage new similar question with help of timestamp.
2. As a new similar question come I make new column qualifier and
store it in same row.
Please suggest that which approach should I follow and also that help in
my map reduce operation where I have to analyze similarity of new
question with every question that already exist. Also if some other
approach can help me please suggest me.
Regards
Aseem Puri