Re: Using JDBC DBStorage (PIG-1229)

Ankur C. Goel Fri, 19 Feb 2010 02:44:35 -0800

Hi Rohan,
               My replies inline..

-...@nkur

On 2/19/10 3:48 PM, "Rohan Rai" <[email protected]> wrote:

Hey Ankur

Thanks in the first place, your effort is so much aligned to my need at
this moment, but needed to understand the UDF a little more.
I had this query

1) Will every reducer have a different connection.
Ankur> Yes every reducer will have a different connection as each reducer is 
executing in a different VM that may be on separate machines.

2) Will the store be transactional in nature across  reducer
Ankur> No by default it can only be transactional within the same reducer.

3) Will the store be transactional on the dataset being pushed to DB
(Even if one fails, there is a roll back)
Ankur> You should be able to achieve this by setting large batch size 
(Integer.MAX_VALUE) when initializing the UDF so that in case of a single store 
(mapper/reducer) failure the entire job is failed by hadoop, and other stores 
don't get a chance to call commit. If you get a chance to experiment more with 
it and discover any problems then please post them on the JIRA.

Regards
Rohan

Ankur C. Goel wrote:
> Zaki,
>         Thanks for the appreciation :-). I agree it is not an efficient way 
> of dumping to DB, for that you would use SQLLoader or something. I myself had 
> exactly the same use case as yours, the reason why it was developed. You can 
> either have the drivers bundled with the UDF jar or have a separate jar and 
> both would need to be registered in the pig script if the drivers are not 
> already installed on your cluster and are part of hadoop classpath.  Just an 
> FYI, DBStorage does not use Hadoop's DBOutputformat. Take a look at 
> TestDBStorage.java for a sample use case.
>
> Hope this helps
>
> -...@nkur
>
>
> On 2/18/10 11:56 PM, "Dmitriy Ryaboy" <[email protected]> wrote:
>
> You can register it in the pig script (or with the a recent patch, on the
> command-line even), and it will get shipped and put on the classpath; or you
> can prep your machines to have the local copy.  For something like JDBC
> drivers I think it may be reasonable to let users decide rather than bundle
> it in by default -- shipping jars from the client to the cluster does have
> some overhead, and a lot of folks will probably have these installed on
> their hadoop nodes anyway.
>
> Just imho (and I haven't actually tried using Ankur's patch yet).
>
> On Thu, Feb 18, 2010 at 9:37 AM, zaki rahaman <[email protected]>wrote:
>
>
>> Hey,
>>
>> First off, @Ankur, great work so far on the patch. This probably is not an
>> efficient way of doing mass dumps to DB (but why would you want to do that
>> anyway when you have HDFS?), but it hits the sweetspot for my particular
>> use
>> case (storing aggregates to interface with a webapp). I was able to apply
>> the patch cleanly and build. I had a question about actually using the
>> DBStorage UDF, namely where I have to keep the JDBC driver? I was wondering
>> if it would be possible to simply bundle it in the same jar as the UDF
>> itself, but I know that Hadoop's DBOutputFormat requires a local copy of
>> the
>> driver on each machine. Any pointers?
>>
>> --
>> Zaki Rahaman
>>
>>
>
> .
>
>

The information contained in this communication is intended solely for the use 
of the individual or entity to whom it is addressed and others authorized to 
receive it. It may contain confidential or legally privileged information. If 
you are not the intended recipient you are hereby notified that any disclosure, 
copying, distribution or taking any action in reliance on the contents of this 
information is strictly prohibited and may be unlawful. If you have received 
this communication in error, please notify us immediately by responding to this 
email and then delete it from your system. The firm is neither liable for the 
proper and complete transmission of the information contained in this 
communication nor for any delay in its receipt.

Re: Using JDBC DBStorage (PIG-1229)

Reply via email to