Re: Hbase Lookup

2015-09-03 Thread Tao Lu
Yes. Ayan, you approach will work.

Or alternatively, use Spark, and write a Scala/Java function which
implements similar logic in your Pig UDF.

Both approaches look similar.

Personally, I would go with Spark solution, it will be slightly faster, and
easier if you already have Spark cluster setup on top of your hadoop
cluster in your infrastructure.

Cheers,
Tao


On Thu, Sep 3, 2015 at 1:15 AM, ayan guha  wrote:

> Thanks for your info. I am planning to implement a pig udf to do record
> look ups. Kindly let me know if this is a good idea.
>
> Best
> Ayan
>
> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke  wrote:
>
>>
>> You may check if it makes sense to write a coprocessor doing an upsert
>> for you, if it does not exist already. Maybe phoenix for Hbase supports
>> this already.
>>
>> Another alternative, if the records do not have an unique Id, is to put
>> them into a text index engine, such as Solr or Elasticsearch, which does in
>> this case a fast matching with relevancy scores.
>>
>>
>> You can use also Spark and Pig there. However, I am not sure if Spark is
>> suitable for these one row lookups. Same holds for Pig.
>>
>>
>> Le mer. 2 sept. 2015 à 23:53, ayan guha  a écrit :
>>
>> Hello group
>>
>> I am trying to use pig or spark in order to achieve following:
>>
>> 1. Write a batch process which will read from a file
>> 2. Lookup hbase to see if the record exists. If so then need to compare
>> incoming values with hbase and update fields which do not match. Else
>> create a new record.
>>
>> My questions:
>> 1. Is this a good use case for pig or spark?
>> 2. Is there any way to read hbase for each incoming record in pig without
>> writing map reduce code?
>> 3. In case of spark I think we have to connect to hbase for every record.
>> Is thr any other way?
>> 4. What is the best connector for hbase which gives this functionality?
>>
>> Best
>>
>> Ayan
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 

Thanks!
Tao


Re: Hbase Lookup

2015-09-03 Thread Tao Lu
But I don't see how it works here with phoenix or hbase coprocessor.
Remember we are joining 2 big data sets here, one is the big file in HDFS,
and records in HBASE. The driving force comes from Hadoop cluster.




On Thu, Sep 3, 2015 at 11:37 AM, Jörn Franke  wrote:

> If you use pig or spark you increase the complexity from an operations
> management perspective significantly. Spark should be seen from a platform
> perspective if it make sense. If you can do it directly with hbase/phoenix
> or only hbase coprocessor then this should be preferred. Otherwise you pay
> more money for maintenance and development.
>
> Le jeu. 3 sept. 2015 à 17:16, Tao Lu  a écrit :
>
>> Yes. Ayan, you approach will work.
>>
>> Or alternatively, use Spark, and write a Scala/Java function which
>> implements similar logic in your Pig UDF.
>>
>> Both approaches look similar.
>>
>> Personally, I would go with Spark solution, it will be slightly faster,
>> and easier if you already have Spark cluster setup on top of your hadoop
>> cluster in your infrastructure.
>>
>> Cheers,
>> Tao
>>
>>
>> On Thu, Sep 3, 2015 at 1:15 AM, ayan guha  wrote:
>>
>>> Thanks for your info. I am planning to implement a pig udf to do record
>>> look ups. Kindly let me know if this is a good idea.
>>>
>>> Best
>>> Ayan
>>>
>>> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke 
>>> wrote:
>>>

 You may check if it makes sense to write a coprocessor doing an upsert
 for you, if it does not exist already. Maybe phoenix for Hbase supports
 this already.

 Another alternative, if the records do not have an unique Id, is to put
 them into a text index engine, such as Solr or Elasticsearch, which does in
 this case a fast matching with relevancy scores.


 You can use also Spark and Pig there. However, I am not sure if Spark
 is suitable for these one row lookups. Same holds for Pig.


 Le mer. 2 sept. 2015 à 23:53, ayan guha  a écrit :

 Hello group

 I am trying to use pig or spark in order to achieve following:

 1. Write a batch process which will read from a file
 2. Lookup hbase to see if the record exists. If so then need to compare
 incoming values with hbase and update fields which do not match. Else
 create a new record.

 My questions:
 1. Is this a good use case for pig or spark?
 2. Is there any way to read hbase for each incoming record in pig
 without writing map reduce code?
 3. In case of spark I think we have to connect to hbase for every
 record. Is thr any other way?
 4. What is the best connector for hbase which gives this functionality?

 Best

 Ayan



>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> 
>> Thanks!
>> Tao
>>
>


-- 

Thanks!
Tao


Re: Hbase Lookup

2015-09-03 Thread Jörn Franke
If you use pig or spark you increase the complexity from an operations
management perspective significantly. Spark should be seen from a platform
perspective if it make sense. If you can do it directly with hbase/phoenix
or only hbase coprocessor then this should be preferred. Otherwise you pay
more money for maintenance and development.

Le jeu. 3 sept. 2015 à 17:16, Tao Lu  a écrit :

> Yes. Ayan, you approach will work.
>
> Or alternatively, use Spark, and write a Scala/Java function which
> implements similar logic in your Pig UDF.
>
> Both approaches look similar.
>
> Personally, I would go with Spark solution, it will be slightly faster,
> and easier if you already have Spark cluster setup on top of your hadoop
> cluster in your infrastructure.
>
> Cheers,
> Tao
>
>
> On Thu, Sep 3, 2015 at 1:15 AM, ayan guha  wrote:
>
>> Thanks for your info. I am planning to implement a pig udf to do record
>> look ups. Kindly let me know if this is a good idea.
>>
>> Best
>> Ayan
>>
>> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke  wrote:
>>
>>>
>>> You may check if it makes sense to write a coprocessor doing an upsert
>>> for you, if it does not exist already. Maybe phoenix for Hbase supports
>>> this already.
>>>
>>> Another alternative, if the records do not have an unique Id, is to put
>>> them into a text index engine, such as Solr or Elasticsearch, which does in
>>> this case a fast matching with relevancy scores.
>>>
>>>
>>> You can use also Spark and Pig there. However, I am not sure if Spark is
>>> suitable for these one row lookups. Same holds for Pig.
>>>
>>>
>>> Le mer. 2 sept. 2015 à 23:53, ayan guha  a écrit :
>>>
>>> Hello group
>>>
>>> I am trying to use pig or spark in order to achieve following:
>>>
>>> 1. Write a batch process which will read from a file
>>> 2. Lookup hbase to see if the record exists. If so then need to compare
>>> incoming values with hbase and update fields which do not match. Else
>>> create a new record.
>>>
>>> My questions:
>>> 1. Is this a good use case for pig or spark?
>>> 2. Is there any way to read hbase for each incoming record in pig
>>> without writing map reduce code?
>>> 3. In case of spark I think we have to connect to hbase for every
>>> record. Is thr any other way?
>>> 4. What is the best connector for hbase which gives this functionality?
>>>
>>> Best
>>>
>>> Ayan
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> 
> Thanks!
> Tao
>


Re: Hbase Lookup

2015-09-03 Thread ayan guha
Hi

Thanks for your comments. My driving point is instead of loading Hbase data
entirely I want to process record by record lookup and that is best done in
UDF or map function. I also would loved to do it in Spark but no production
cluster yet here :(

@Franke: I do not have enough competency on coprocessors so I am not able
to visualize the solution as you are suggesting, so it would be really
helpful if you shed some more light to it?

Best
Ayan

On Fri, Sep 4, 2015 at 1:44 AM, Tao Lu  wrote:

> But I don't see how it works here with phoenix or hbase coprocessor.
> Remember we are joining 2 big data sets here, one is the big file in HDFS,
> and records in HBASE. The driving force comes from Hadoop cluster.
>
>
>
>
> On Thu, Sep 3, 2015 at 11:37 AM, Jörn Franke  wrote:
>
>> If you use pig or spark you increase the complexity from an operations
>> management perspective significantly. Spark should be seen from a platform
>> perspective if it make sense. If you can do it directly with hbase/phoenix
>> or only hbase coprocessor then this should be preferred. Otherwise you pay
>> more money for maintenance and development.
>>
>> Le jeu. 3 sept. 2015 à 17:16, Tao Lu  a écrit :
>>
>>> Yes. Ayan, you approach will work.
>>>
>>> Or alternatively, use Spark, and write a Scala/Java function which
>>> implements similar logic in your Pig UDF.
>>>
>>> Both approaches look similar.
>>>
>>> Personally, I would go with Spark solution, it will be slightly faster,
>>> and easier if you already have Spark cluster setup on top of your hadoop
>>> cluster in your infrastructure.
>>>
>>> Cheers,
>>> Tao
>>>
>>>
>>> On Thu, Sep 3, 2015 at 1:15 AM, ayan guha  wrote:
>>>
 Thanks for your info. I am planning to implement a pig udf to do record
 look ups. Kindly let me know if this is a good idea.

 Best
 Ayan

 On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke 
 wrote:

>
> You may check if it makes sense to write a coprocessor doing an upsert
> for you, if it does not exist already. Maybe phoenix for Hbase supports
> this already.
>
> Another alternative, if the records do not have an unique Id, is to
> put them into a text index engine, such as Solr or Elasticsearch, which
> does in this case a fast matching with relevancy scores.
>
>
> You can use also Spark and Pig there. However, I am not sure if Spark
> is suitable for these one row lookups. Same holds for Pig.
>
>
> Le mer. 2 sept. 2015 à 23:53, ayan guha  a
> écrit :
>
> Hello group
>
> I am trying to use pig or spark in order to achieve following:
>
> 1. Write a batch process which will read from a file
> 2. Lookup hbase to see if the record exists. If so then need to
> compare incoming values with hbase and update fields which do not match.
> Else create a new record.
>
> My questions:
> 1. Is this a good use case for pig or spark?
> 2. Is there any way to read hbase for each incoming record in pig
> without writing map reduce code?
> 3. In case of spark I think we have to connect to hbase for every
> record. Is thr any other way?
> 4. What is the best connector for hbase which gives this functionality?
>
> Best
>
> Ayan
>
>
>


 --
 Best Regards,
 Ayan Guha

>>>
>>>
>>>
>>> --
>>> 
>>> Thanks!
>>> Tao
>>>
>>
>
>
> --
> 
> Thanks!
> Tao
>



-- 
Best Regards,
Ayan Guha


Re: Hbase Lookup

2015-09-03 Thread Ted Yu
Ayan:
Please read this:
http://hbase.apache.org/book.html#cp

Cheers

On Thu, Sep 3, 2015 at 2:13 PM, ayan guha  wrote:

> Hi
>
> Thanks for your comments. My driving point is instead of loading Hbase
> data entirely I want to process record by record lookup and that is best
> done in UDF or map function. I also would loved to do it in Spark but no
> production cluster yet here :(
>
> @Franke: I do not have enough competency on coprocessors so I am not able
> to visualize the solution as you are suggesting, so it would be really
> helpful if you shed some more light to it?
>
> Best
> Ayan
>
> On Fri, Sep 4, 2015 at 1:44 AM, Tao Lu  wrote:
>
>> But I don't see how it works here with phoenix or hbase coprocessor.
>> Remember we are joining 2 big data sets here, one is the big file in HDFS,
>> and records in HBASE. The driving force comes from Hadoop cluster.
>>
>>
>>
>>
>> On Thu, Sep 3, 2015 at 11:37 AM, Jörn Franke 
>> wrote:
>>
>>> If you use pig or spark you increase the complexity from an operations
>>> management perspective significantly. Spark should be seen from a platform
>>> perspective if it make sense. If you can do it directly with hbase/phoenix
>>> or only hbase coprocessor then this should be preferred. Otherwise you pay
>>> more money for maintenance and development.
>>>
>>> Le jeu. 3 sept. 2015 à 17:16, Tao Lu  a écrit :
>>>
 Yes. Ayan, you approach will work.

 Or alternatively, use Spark, and write a Scala/Java function which
 implements similar logic in your Pig UDF.

 Both approaches look similar.

 Personally, I would go with Spark solution, it will be slightly faster,
 and easier if you already have Spark cluster setup on top of your hadoop
 cluster in your infrastructure.

 Cheers,
 Tao


 On Thu, Sep 3, 2015 at 1:15 AM, ayan guha  wrote:

> Thanks for your info. I am planning to implement a pig udf to do
> record look ups. Kindly let me know if this is a good idea.
>
> Best
> Ayan
>
> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke 
> wrote:
>
>>
>> You may check if it makes sense to write a coprocessor doing an
>> upsert for you, if it does not exist already. Maybe phoenix for Hbase
>> supports this already.
>>
>> Another alternative, if the records do not have an unique Id, is to
>> put them into a text index engine, such as Solr or Elasticsearch, which
>> does in this case a fast matching with relevancy scores.
>>
>>
>> You can use also Spark and Pig there. However, I am not sure if Spark
>> is suitable for these one row lookups. Same holds for Pig.
>>
>>
>> Le mer. 2 sept. 2015 à 23:53, ayan guha  a
>> écrit :
>>
>> Hello group
>>
>> I am trying to use pig or spark in order to achieve following:
>>
>> 1. Write a batch process which will read from a file
>> 2. Lookup hbase to see if the record exists. If so then need to
>> compare incoming values with hbase and update fields which do not match.
>> Else create a new record.
>>
>> My questions:
>> 1. Is this a good use case for pig or spark?
>> 2. Is there any way to read hbase for each incoming record in pig
>> without writing map reduce code?
>> 3. In case of spark I think we have to connect to hbase for every
>> record. Is thr any other way?
>> 4. What is the best connector for hbase which gives this
>> functionality?
>>
>> Best
>>
>> Ayan
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



 --
 
 Thanks!
 Tao

>>>
>>
>>
>> --
>> 
>> Thanks!
>> Tao
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Hbase Lookup

2015-09-02 Thread Jörn Franke
You may check if it makes sense to write a coprocessor doing an upsert for
you, if it does not exist already. Maybe phoenix for Hbase supports this
already.

Another alternative, if the records do not have an unique Id, is to put
them into a text index engine, such as Solr or Elasticsearch, which does in
this case a fast matching with relevancy scores.


You can use also Spark and Pig there. However, I am not sure if Spark is
suitable for these one row lookups. Same holds for Pig.


Le mer. 2 sept. 2015 à 23:53, ayan guha  a écrit :

Hello group

I am trying to use pig or spark in order to achieve following:

1. Write a batch process which will read from a file
2. Lookup hbase to see if the record exists. If so then need to compare
incoming values with hbase and update fields which do not match. Else
create a new record.

My questions:
1. Is this a good use case for pig or spark?
2. Is there any way to read hbase for each incoming record in pig without
writing map reduce code?
3. In case of spark I think we have to connect to hbase for every record.
Is thr any other way?
4. What is the best connector for hbase which gives this functionality?

Best

Ayan


Re: Hbase Lookup

2015-09-02 Thread ayan guha
Thanks for your info. I am planning to implement a pig udf to do record
look ups. Kindly let me know if this is a good idea.

Best
Ayan

On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke  wrote:

>
> You may check if it makes sense to write a coprocessor doing an upsert for
> you, if it does not exist already. Maybe phoenix for Hbase supports this
> already.
>
> Another alternative, if the records do not have an unique Id, is to put
> them into a text index engine, such as Solr or Elasticsearch, which does in
> this case a fast matching with relevancy scores.
>
>
> You can use also Spark and Pig there. However, I am not sure if Spark is
> suitable for these one row lookups. Same holds for Pig.
>
>
> Le mer. 2 sept. 2015 à 23:53, ayan guha  a écrit :
>
> Hello group
>
> I am trying to use pig or spark in order to achieve following:
>
> 1. Write a batch process which will read from a file
> 2. Lookup hbase to see if the record exists. If so then need to compare
> incoming values with hbase and update fields which do not match. Else
> create a new record.
>
> My questions:
> 1. Is this a good use case for pig or spark?
> 2. Is there any way to read hbase for each incoming record in pig without
> writing map reduce code?
> 3. In case of spark I think we have to connect to hbase for every record.
> Is thr any other way?
> 4. What is the best connector for hbase which gives this functionality?
>
> Best
>
> Ayan
>
>
>


-- 
Best Regards,
Ayan Guha


Hbase Lookup

2015-09-02 Thread ayan guha
Hello group

I am trying to use pig or spark in order to achieve following:

1. Write a batch process which will read from a file
2. Lookup hbase to see if the record exists. If so then need to compare
incoming values with hbase and update fields which do not match. Else
create a new record.

My questions:
1. Is this a good use case for pig or spark?
2. Is there any way to read hbase for each incoming record in pig without
writing map reduce code?
3. In case of spark I think we have to connect to hbase for every record.
Is thr any other way?
4. What is the best connector for hbase which gives this functionality?

Best
Ayan