Ayan: Please read this: http://hbase.apache.org/book.html#cp
Cheers On Thu, Sep 3, 2015 at 2:13 PM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > Thanks for your comments. My driving point is instead of loading Hbase > data entirely I want to process record by record lookup and that is best > done in UDF or map function. I also would loved to do it in Spark but no > production cluster yet here :( > > @Franke: I do not have enough competency on coprocessors so I am not able > to visualize the solution as you are suggesting, so it would be really > helpful if you shed some more light to it? > > Best > Ayan > > On Fri, Sep 4, 2015 at 1:44 AM, Tao Lu <taolu2...@gmail.com> wrote: > >> But I don't see how it works here with phoenix or hbase coprocessor. >> Remember we are joining 2 big data sets here, one is the big file in HDFS, >> and records in HBASE. The driving force comes from Hadoop cluster. >> >> >> >> >> On Thu, Sep 3, 2015 at 11:37 AM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >>> If you use pig or spark you increase the complexity from an operations >>> management perspective significantly. Spark should be seen from a platform >>> perspective if it make sense. If you can do it directly with hbase/phoenix >>> or only hbase coprocessor then this should be preferred. Otherwise you pay >>> more money for maintenance and development. >>> >>> Le jeu. 3 sept. 2015 à 17:16, Tao Lu <taolu2...@gmail.com> a écrit : >>> >>>> Yes. Ayan, you approach will work. >>>> >>>> Or alternatively, use Spark, and write a Scala/Java function which >>>> implements similar logic in your Pig UDF. >>>> >>>> Both approaches look similar. >>>> >>>> Personally, I would go with Spark solution, it will be slightly faster, >>>> and easier if you already have Spark cluster setup on top of your hadoop >>>> cluster in your infrastructure. >>>> >>>> Cheers, >>>> Tao >>>> >>>> >>>> On Thu, Sep 3, 2015 at 1:15 AM, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>>> Thanks for your info. I am planning to implement a pig udf to do >>>>> record look ups. Kindly let me know if this is a good idea. >>>>> >>>>> Best >>>>> Ayan >>>>> >>>>> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke <jornfra...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> You may check if it makes sense to write a coprocessor doing an >>>>>> upsert for you, if it does not exist already. Maybe phoenix for Hbase >>>>>> supports this already. >>>>>> >>>>>> Another alternative, if the records do not have an unique Id, is to >>>>>> put them into a text index engine, such as Solr or Elasticsearch, which >>>>>> does in this case a fast matching with relevancy scores. >>>>>> >>>>>> >>>>>> You can use also Spark and Pig there. However, I am not sure if Spark >>>>>> is suitable for these one row lookups. Same holds for Pig. >>>>>> >>>>>> >>>>>> Le mer. 2 sept. 2015 à 23:53, ayan guha <guha.a...@gmail.com> a >>>>>> écrit : >>>>>> >>>>>> Hello group >>>>>> >>>>>> I am trying to use pig or spark in order to achieve following: >>>>>> >>>>>> 1. Write a batch process which will read from a file >>>>>> 2. Lookup hbase to see if the record exists. If so then need to >>>>>> compare incoming values with hbase and update fields which do not match. >>>>>> Else create a new record. >>>>>> >>>>>> My questions: >>>>>> 1. Is this a good use case for pig or spark? >>>>>> 2. Is there any way to read hbase for each incoming record in pig >>>>>> without writing map reduce code? >>>>>> 3. In case of spark I think we have to connect to hbase for every >>>>>> record. Is thr any other way? >>>>>> 4. What is the best connector for hbase which gives this >>>>>> functionality? >>>>>> >>>>>> Best >>>>>> >>>>>> Ayan >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>> >>>> >>>> >>>> -- >>>> ------------------------------------------------ >>>> Thanks! >>>> Tao >>>> >>> >> >> >> -- >> ------------------------------------------------ >> Thanks! >> Tao >> > > > > -- > Best Regards, > Ayan Guha >