I'm having a little difficulty totally understanding your requirements, but let me take a stab.

You basically want a mapping from 1 to N QUESTIONS to a single ANSWER? When a new question comes in, you run an MR job that scans all existing questions and does some kind of similarity metric against them to try to find existing matches, and if one is found, add the new question to the list of questions for that answer, and return the answer.

The first big question I have is, are you expecting this question-matching query to be done in real-time? Or this is an offline, batch process? Remember, MapReduce is not for real-time queries. At the low end, for simple jobs, you will always run for several seconds if not tens of seconds (for VERY simple jobs).

But it seems like you would need to scan the entire table, and run something like a cosine similarity against every single question in it. That's going to be a much longer running job, depending on how many questions already exist, and certainly not real-time.

As for actually storing the questions, you should create two column families "questions" and "answer". For each question, you insert a column into the "questions" family. The "answer" family would always have a single column (only a single answer right?). Then you can very easily query for all questions, and they will be grouped by row (I'm not sure what your row key will be).

You didn't talk much about how you plan on doing dupe-detection of questions, but there are some interesting ways to generate signatures which could turn into your row keys, then you could actually do some kind of online duplicate detecting of already answered questions. That's beyond the scope of this mailing list, however.

Hope that helps.  If you need more help, please provide more detail.

JG

Puri, Aseem wrote:
Hello

I am working on a model in which I have to manage question and their
answers.
I create two columns, one in which question is to be store and other its
answer.

Now people will ask question, so when a new question come I want to
execute map reduce job which find is same kind of question is already
exist or not.

If same question is asked then with map reduce I will find similar
question that exist and provide answer to him that is already there with
it. Also I want to append it with the similar question that is already
their in my table.

If question is different then I will store it in different row and its
answer will be given by some expert and be stored.
I know Hadoop HBase have property write once read many times. So I can't
append it.

I have two other options.

1.      Manage new similar question with help of timestamp.

2.      As a new similar question come I make new column qualifier and
store it in same row.
Please suggest that which approach should I follow and also that help in
my map reduce operation where I have to analyze similarity of new
question with every question that already exist. Also if some other
approach can help me please suggest me.

Regards

Aseem Puri


Reply via email to