Re: How does PySpark send "import" to the worker when executing Python UDFs?

2022-07-19 Thread Li Jin
Aha I see. Thanks Hyukjin!

On Tue, Jul 19, 2022 at 9:09 PM Hyukjin Kwon  wrote:

> This is done by cloudpickle. They pickle global variables referred within
> the func together, and register it to the global imported modules.
>
> On Wed, 20 Jul 2022 at 00:55, Li Jin  wrote:
>
>> Hi,
>>
>> I have a question about how does "imports" get send to the python worker.
>>
>> For example, I have
>>
>> def foo(x):
>> return np.abs(x)
>>
>> If I run this code directly, it obviously failed (because np is undefined
>> on the driver process):
>>
>> sc.paralleilize([1, 2, 3]).map(foo).collect()
>>
>> However, if I add the import statement "import numpy as np" on the
>> driver, it works. So somehow driver is sending that "imports" to the worker
>> when executing foo on the worker but I cannot seem t o find the code that
>> does this - Can someone please send me a pointer?
>>
>> Thanks,
>> Li
>>
>


Re: How does PySpark send "import" to the worker when executing Python UDFs?

2022-07-19 Thread Hyukjin Kwon
This is done by cloudpickle. They pickle global variables referred within
the func together, and register it to the global imported modules.

On Wed, 20 Jul 2022 at 00:55, Li Jin  wrote:

> Hi,
>
> I have a question about how does "imports" get send to the python worker.
>
> For example, I have
>
> def foo(x):
> return np.abs(x)
>
> If I run this code directly, it obviously failed (because np is undefined
> on the driver process):
>
> sc.paralleilize([1, 2, 3]).map(foo).collect()
>
> However, if I add the import statement "import numpy as np" on the driver,
> it works. So somehow driver is sending that "imports" to the worker when
> executing foo on the worker but I cannot seem t o find the code that does
> this - Can someone please send me a pointer?
>
> Thanks,
> Li
>


How does PySpark send "import" to the worker when executing Python UDFs?

2022-07-19 Thread Li Jin
Hi,

I have a question about how does "imports" get send to the python worker.

For example, I have

def foo(x):
return np.abs(x)

If I run this code directly, it obviously failed (because np is undefined
on the driver process):

sc.paralleilize([1, 2, 3]).map(foo).collect()

However, if I add the import statement "import numpy as np" on the driver,
it works. So somehow driver is sending that "imports" to the worker when
executing foo on the worker but I cannot seem t o find the code that does
this - Can someone please send me a pointer?

Thanks,
Li