Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
Hi Andrew,
Do not misrepresent my statements.
I mentioned it depends on the used case, I NEVER (note the word "never")
mentioned that Pandas UDF is ALWAYS (note the word "always") slow.


Regards,
Gourav Sengupta

On Mon, May 6, 2019 at 6:00 PM Andrew Melo  wrote:

> Hi,
>
> On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta
>  wrote:
> >
> > Hence, what I mentioned initially does sound correct ?
>
> I don't agree at all - we've had a significant boost from moving to
> regular UDFs to pandas UDFs. YMMV, of course.
>
> >
> > On Mon, May 6, 2019 at 5:43 PM Andrew Melo 
> wrote:
> >>
> >> Hi,
> >>
> >> On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy
> >>  wrote:
> >> >
> >> > Thanks Gourav.
> >> >
> >> > Incidentally, since the regular UDF is row-wise, we could optimize
> that a bit by taking the convert() closure and simply making that the UDF.
> >> >
> >> > Since there's that MGRS object that we have to create too, we could
> probably optimize it further by applying the UDF via rdd.mapPartitions,
> which would allow the UDF to instantiate objects once per-partition instead
> of per-row and then iterate element-wise through the rows of the partition.
> >> >
> >> > All that said, having done the above on prior projects I find the
> pandas abstractions to be very elegant and friendly to the end-user so I
> haven't looked back :)
> >> >
> >> > (The common memory model via Arrow is a nice boost too!)
> >>
> >> And some tentative SPIPs that want to use columnar representations
> >> internally in Spark should also add some good performance in the
> >> future.
> >>
> >> Cheers
> >> Andrew
> >>
> >> >
> >> > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >> >>
> >> >> The proof is in the pudding
> >> >>
> >> >> :)
> >> >>
> >> >>
> >> >>
> >> >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >> >>>
> >> >>> Hi Patrick,
> >> >>>
> >> >>> super duper, thanks a ton for sharing the code. Can you please
> confirm that this runs faster than the regular UDF's?
> >> >>>
> >> >>> Interestingly I am also running same transformations using another
> geo spatial library in Python, where I am passing two fields and getting
> back an array.
> >> >>>
> >> >>>
> >> >>> Regards,
> >> >>> Gourav Sengupta
> >> >>>
> >> >>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <
> pmccar...@dstillery.com> wrote:
> >> 
> >>  Human time is considerably more expensive than computer time, so
> in that regard, yes :)
> >> 
> >>  This took me one minute to write and ran fast enough for my needs.
> If you're willing to provide a comparable scala implementation I'd be happy
> to compare them.
> >> 
> >>  @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
> >> 
> >>  def generate_mgrs_series(lat_lon_str, level):
> >> 
> >> 
> >>  import mgrs
> >> 
> >>  m = mgrs.MGRS()
> >> 
> >> 
> >>  precision_level = 0
> >> 
> >>  levelval = level[0]
> >> 
> >> 
> >>  if levelval == 1000:
> >> 
> >> precision_level = 2
> >> 
> >>  if levelval == 100:
> >> 
> >> precision_level = 3
> >> 
> >> 
> >>  def convert(ll_str):
> >> 
> >>    lat, lon = ll_str.split('_')
> >> 
> >> 
> >>    return m.toMGRS(lat, lon,
> >> 
> >>    MGRSPrecision = precision_level)
> >> 
> >> 
> >>  return lat_lon_str.apply(lambda x: convert(x))
> >> 
> >>  On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >> >
> >> > And you found the PANDAS UDF more performant ? Can you share your
> code and prove it?
> >> >
> >> > On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <
> pmccar...@dstillery.com> wrote:
> >> >>
> >> >> I disagree that it's hype. Perhaps not 1:1 with pure scala
> performance-wise, but for python-based data scientists or others with a lot
> of python expertise it allows one to do things that would otherwise be
> infeasible at scale.
> >> >>
> >> >> For instance, I recently had to convert latitude / longitude
> pairs to MGRS strings (
> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing a
> pandas UDF (and putting the mgrs python package into a conda environment)
> was _significantly_ easier than any alternative I found.
> >> >>
> >> >> @Rishi - depending on your network is constructed, some lag
> could come from just uploading the conda environment. If you load it from
> hdfs with --archives does it improve?
> >> >>
> >> >> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >> >>>
> >> >>> hi,
> >> >>>
> >> >>> Pandas UDF is a bit of hype. One of their blogs shows the used
> case of adding 1 to a field using Pandas UDF which is pretty much
> pointless. So you go beyond the blog and realise that your actual 

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
Hence, what I mentioned initially does sound correct ?

On Mon, May 6, 2019 at 5:43 PM Andrew Melo  wrote:

> Hi,
>
> On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy
>  wrote:
> >
> > Thanks Gourav.
> >
> > Incidentally, since the regular UDF is row-wise, we could optimize that
> a bit by taking the convert() closure and simply making that the UDF.
> >
> > Since there's that MGRS object that we have to create too, we could
> probably optimize it further by applying the UDF via rdd.mapPartitions,
> which would allow the UDF to instantiate objects once per-partition instead
> of per-row and then iterate element-wise through the rows of the partition.
> >
> > All that said, having done the above on prior projects I find the pandas
> abstractions to be very elegant and friendly to the end-user so I haven't
> looked back :)
> >
> > (The common memory model via Arrow is a nice boost too!)
>
> And some tentative SPIPs that want to use columnar representations
> internally in Spark should also add some good performance in the
> future.
>
> Cheers
> Andrew
>
> >
> > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >>
> >> The proof is in the pudding
> >>
> >> :)
> >>
> >>
> >>
> >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >>>
> >>> Hi Patrick,
> >>>
> >>> super duper, thanks a ton for sharing the code. Can you please confirm
> that this runs faster than the regular UDF's?
> >>>
> >>> Interestingly I am also running same transformations using another geo
> spatial library in Python, where I am passing two fields and getting back
> an array.
> >>>
> >>>
> >>> Regards,
> >>> Gourav Sengupta
> >>>
> >>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <
> pmccar...@dstillery.com> wrote:
> 
>  Human time is considerably more expensive than computer time, so in
> that regard, yes :)
> 
>  This took me one minute to write and ran fast enough for my needs. If
> you're willing to provide a comparable scala implementation I'd be happy to
> compare them.
> 
>  @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
> 
>  def generate_mgrs_series(lat_lon_str, level):
> 
> 
>  import mgrs
> 
>  m = mgrs.MGRS()
> 
> 
>  precision_level = 0
> 
>  levelval = level[0]
> 
> 
>  if levelval == 1000:
> 
> precision_level = 2
> 
>  if levelval == 100:
> 
> precision_level = 3
> 
> 
>  def convert(ll_str):
> 
>    lat, lon = ll_str.split('_')
> 
> 
>    return m.toMGRS(lat, lon,
> 
>    MGRSPrecision = precision_level)
> 
> 
>  return lat_lon_str.apply(lambda x: convert(x))
> 
>  On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >
> > And you found the PANDAS UDF more performant ? Can you share your
> code and prove it?
> >
> > On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <
> pmccar...@dstillery.com> wrote:
> >>
> >> I disagree that it's hype. Perhaps not 1:1 with pure scala
> performance-wise, but for python-based data scientists or others with a lot
> of python expertise it allows one to do things that would otherwise be
> infeasible at scale.
> >>
> >> For instance, I recently had to convert latitude / longitude pairs
> to MGRS strings (
> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing a
> pandas UDF (and putting the mgrs python package into a conda environment)
> was _significantly_ easier than any alternative I found.
> >>
> >> @Rishi - depending on your network is constructed, some lag could
> come from just uploading the conda environment. If you load it from hdfs
> with --archives does it improve?
> >>
> >> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >>>
> >>> hi,
> >>>
> >>> Pandas UDF is a bit of hype. One of their blogs shows the used
> case of adding 1 to a field using Pandas UDF which is pretty much
> pointless. So you go beyond the blog and realise that your actual used case
> is more than adding one :) and the reality hits you
> >>>
> >>> Pandas UDF in certain scenarios is actually slow, try using apply
> for a custom or pandas function. In fact in certain scenarios I have found
> general UDF's work much faster and use much less memory. Therefore test out
> your used case (with at least 30 million records) before trying to use the
> Pandas UDF option.
> >>>
> >>> And when you start using GroupMap then you realise after reading
> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
> that "Oh!! now I can run into random OOM errors and the maxrecords options
> does not help at all"
> >>>
> >>> Excerpt from the above link:
> >>> Note that all data for a 

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi,

On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta
 wrote:
>
> Hence, what I mentioned initially does sound correct ?

I don't agree at all - we've had a significant boost from moving to
regular UDFs to pandas UDFs. YMMV, of course.

>
> On Mon, May 6, 2019 at 5:43 PM Andrew Melo  wrote:
>>
>> Hi,
>>
>> On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy
>>  wrote:
>> >
>> > Thanks Gourav.
>> >
>> > Incidentally, since the regular UDF is row-wise, we could optimize that a 
>> > bit by taking the convert() closure and simply making that the UDF.
>> >
>> > Since there's that MGRS object that we have to create too, we could 
>> > probably optimize it further by applying the UDF via rdd.mapPartitions, 
>> > which would allow the UDF to instantiate objects once per-partition 
>> > instead of per-row and then iterate element-wise through the rows of the 
>> > partition.
>> >
>> > All that said, having done the above on prior projects I find the pandas 
>> > abstractions to be very elegant and friendly to the end-user so I haven't 
>> > looked back :)
>> >
>> > (The common memory model via Arrow is a nice boost too!)
>>
>> And some tentative SPIPs that want to use columnar representations
>> internally in Spark should also add some good performance in the
>> future.
>>
>> Cheers
>> Andrew
>>
>> >
>> > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta 
>> >  wrote:
>> >>
>> >> The proof is in the pudding
>> >>
>> >> :)
>> >>
>> >>
>> >>
>> >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta 
>> >>  wrote:
>> >>>
>> >>> Hi Patrick,
>> >>>
>> >>> super duper, thanks a ton for sharing the code. Can you please confirm 
>> >>> that this runs faster than the regular UDF's?
>> >>>
>> >>> Interestingly I am also running same transformations using another geo 
>> >>> spatial library in Python, where I am passing two fields and getting 
>> >>> back an array.
>> >>>
>> >>>
>> >>> Regards,
>> >>> Gourav Sengupta
>> >>>
>> >>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy 
>> >>>  wrote:
>> 
>>  Human time is considerably more expensive than computer time, so in 
>>  that regard, yes :)
>> 
>>  This took me one minute to write and ran fast enough for my needs. If 
>>  you're willing to provide a comparable scala implementation I'd be 
>>  happy to compare them.
>> 
>>  @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
>> 
>>  def generate_mgrs_series(lat_lon_str, level):
>> 
>> 
>>  import mgrs
>> 
>>  m = mgrs.MGRS()
>> 
>> 
>>  precision_level = 0
>> 
>>  levelval = level[0]
>> 
>> 
>>  if levelval == 1000:
>> 
>> precision_level = 2
>> 
>>  if levelval == 100:
>> 
>> precision_level = 3
>> 
>> 
>>  def convert(ll_str):
>> 
>>    lat, lon = ll_str.split('_')
>> 
>> 
>>    return m.toMGRS(lat, lon,
>> 
>>    MGRSPrecision = precision_level)
>> 
>> 
>>  return lat_lon_str.apply(lambda x: convert(x))
>> 
>>  On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta 
>>   wrote:
>> >
>> > And you found the PANDAS UDF more performant ? Can you share your code 
>> > and prove it?
>> >
>> > On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy 
>> >  wrote:
>> >>
>> >> I disagree that it's hype. Perhaps not 1:1 with pure scala 
>> >> performance-wise, but for python-based data scientists or others with 
>> >> a lot of python expertise it allows one to do things that would 
>> >> otherwise be infeasible at scale.
>> >>
>> >> For instance, I recently had to convert latitude / longitude pairs to 
>> >> MGRS strings 
>> >> (https://en.wikipedia.org/wiki/Military_Grid_Reference_System). 
>> >> Writing a pandas UDF (and putting the mgrs python package into a 
>> >> conda environment) was _significantly_ easier than any alternative I 
>> >> found.
>> >>
>> >> @Rishi - depending on your network is constructed, some lag could 
>> >> come from just uploading the conda environment. If you load it from 
>> >> hdfs with --archives does it improve?
>> >>
>> >> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta 
>> >>  wrote:
>> >>>
>> >>> hi,
>> >>>
>> >>> Pandas UDF is a bit of hype. One of their blogs shows the used case 
>> >>> of adding 1 to a field using Pandas UDF which is pretty much 
>> >>> pointless. So you go beyond the blog and realise that your actual 
>> >>> used case is more than adding one :) and the reality hits you
>> >>>
>> >>> Pandas UDF in certain scenarios is actually slow, try using apply 
>> >>> for a custom or pandas function. In fact in certain scenarios I have 
>> >>> found general UDF's work much faster and use much less memory. 
>> >>> Therefore test out your used case (with at least 30 million records) 
>> >>> before trying to use 

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi,

On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy
 wrote:
>
> Thanks Gourav.
>
> Incidentally, since the regular UDF is row-wise, we could optimize that a bit 
> by taking the convert() closure and simply making that the UDF.
>
> Since there's that MGRS object that we have to create too, we could probably 
> optimize it further by applying the UDF via rdd.mapPartitions, which would 
> allow the UDF to instantiate objects once per-partition instead of per-row 
> and then iterate element-wise through the rows of the partition.
>
> All that said, having done the above on prior projects I find the pandas 
> abstractions to be very elegant and friendly to the end-user so I haven't 
> looked back :)
>
> (The common memory model via Arrow is a nice boost too!)

And some tentative SPIPs that want to use columnar representations
internally in Spark should also add some good performance in the
future.

Cheers
Andrew

>
> On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta  
> wrote:
>>
>> The proof is in the pudding
>>
>> :)
>>
>>
>>
>> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta  
>> wrote:
>>>
>>> Hi Patrick,
>>>
>>> super duper, thanks a ton for sharing the code. Can you please confirm that 
>>> this runs faster than the regular UDF's?
>>>
>>> Interestingly I am also running same transformations using another geo 
>>> spatial library in Python, where I am passing two fields and getting back 
>>> an array.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy  
>>> wrote:

 Human time is considerably more expensive than computer time, so in that 
 regard, yes :)

 This took me one minute to write and ran fast enough for my needs. If 
 you're willing to provide a comparable scala implementation I'd be happy 
 to compare them.

 @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)

 def generate_mgrs_series(lat_lon_str, level):


 import mgrs

 m = mgrs.MGRS()


 precision_level = 0

 levelval = level[0]


 if levelval == 1000:

precision_level = 2

 if levelval == 100:

precision_level = 3


 def convert(ll_str):

   lat, lon = ll_str.split('_')


   return m.toMGRS(lat, lon,

   MGRSPrecision = precision_level)


 return lat_lon_str.apply(lambda x: convert(x))

 On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta  
 wrote:
>
> And you found the PANDAS UDF more performant ? Can you share your code 
> and prove it?
>
> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy  
> wrote:
>>
>> I disagree that it's hype. Perhaps not 1:1 with pure scala 
>> performance-wise, but for python-based data scientists or others with a 
>> lot of python expertise it allows one to do things that would otherwise 
>> be infeasible at scale.
>>
>> For instance, I recently had to convert latitude / longitude pairs to 
>> MGRS strings 
>> (https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing 
>> a pandas UDF (and putting the mgrs python package into a conda 
>> environment) was _significantly_ easier than any alternative I found.
>>
>> @Rishi - depending on your network is constructed, some lag could come 
>> from just uploading the conda environment. If you load it from hdfs with 
>> --archives does it improve?
>>
>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta 
>>  wrote:
>>>
>>> hi,
>>>
>>> Pandas UDF is a bit of hype. One of their blogs shows the used case of 
>>> adding 1 to a field using Pandas UDF which is pretty much pointless. So 
>>> you go beyond the blog and realise that your actual used case is more 
>>> than adding one :) and the reality hits you
>>>
>>> Pandas UDF in certain scenarios is actually slow, try using apply for a 
>>> custom or pandas function. In fact in certain scenarios I have found 
>>> general UDF's work much faster and use much less memory. Therefore test 
>>> out your used case (with at least 30 million records) before trying to 
>>> use the Pandas UDF option.
>>>
>>> And when you start using GroupMap then you realise after reading 
>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>>>  that "Oh!! now I can run into random OOM errors and the maxrecords 
>>> options does not help at all"
>>>
>>> Excerpt from the above link:
>>> Note that all data for a group will be loaded into memory before the 
>>> function is applied. This can lead to out of memory exceptions, 
>>> especially if the group sizes are skewed. The configuration for 
>>> maxRecordsPerBatch is not applied on groups and it is up to the user to 
>>> ensure that the grouped data will 

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
Thanks Gourav.

Incidentally, since the regular UDF is row-wise, we could optimize that a
bit by taking the convert() closure and simply making that the UDF.

Since there's that MGRS object that we have to create too, we could
probably optimize it further by applying the UDF via rdd.mapPartitions,
which would allow the UDF to instantiate objects once per-partition instead
of per-row and then iterate element-wise through the rows of the partition.

All that said, having done the above on prior projects I find the pandas
abstractions to be very elegant and friendly to the end-user so I haven't
looked back :)

(The common memory model via Arrow is a nice boost too!)

On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta 
wrote:

> The proof is in the pudding
>
> :)
>
>
>
> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta 
> wrote:
>
>> Hi Patrick,
>>
>> super duper, thanks a ton for sharing the code. Can you please confirm
>> that this runs faster than the regular UDF's?
>>
>> Interestingly I am also running same transformations using another geo
>> spatial library in Python, where I am passing two fields and getting back
>> an array.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy 
>> wrote:
>>
>>> Human time is considerably more expensive than computer time, so in that
>>> regard, yes :)
>>>
>>> This took me one minute to write and ran fast enough for my needs. If
>>> you're willing to provide a comparable scala implementation I'd be happy to
>>> compare them.
>>>
>>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
>>>
>>> def generate_mgrs_series(lat_lon_str, level):
>>>
>>> import mgrs
>>>
>>> m = mgrs.MGRS()
>>>
>>> precision_level = 0
>>>
>>> levelval = level[0]
>>>
>>> if levelval == 1000:
>>>
>>>precision_level = 2
>>>
>>> if levelval == 100:
>>>
>>>precision_level = 3
>>>
>>> def convert(ll_str):
>>>
>>>   lat, lon = ll_str.split('_')
>>>
>>>   return m.toMGRS(lat, lon,
>>>
>>>   MGRSPrecision = precision_level)
>>>
>>> return lat_lon_str.apply(lambda x: convert(x))
>>>
>>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 And you found the PANDAS UDF more performant ? Can you share your code
 and prove it?

 On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <
 pmccar...@dstillery.com> wrote:

> I disagree that it's hype. Perhaps not 1:1 with pure scala
> performance-wise, but for python-based data scientists or others with a 
> lot
> of python expertise it allows one to do things that would otherwise be
> infeasible at scale.
>
> For instance, I recently had to convert latitude / longitude pairs to
> MGRS strings (
> https://en.wikipedia.org/wiki/Military_Grid_Reference_System).
> Writing a pandas UDF (and putting the mgrs python package into a conda
> environment) was _significantly_ easier than any alternative I found.
>
> @Rishi - depending on your network is constructed, some lag could come
> from just uploading the conda environment. If you load it from hdfs with
> --archives does it improve?
>
> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> hi,
>>
>> Pandas UDF is a bit of hype. One of their blogs shows the used case
>> of adding 1 to a field using Pandas UDF which is pretty much pointless. 
>> So
>> you go beyond the blog and realise that your actual used case is more 
>> than
>> adding one :) and the reality hits you
>>
>> Pandas UDF in certain scenarios is actually slow, try using apply for
>> a custom or pandas function. In fact in certain scenarios I have found
>> general UDF's work much faster and use much less memory. Therefore test 
>> out
>> your used case (with at least 30 million records) before trying to use 
>> the
>> Pandas UDF option.
>>
>> And when you start using GroupMap then you realise after reading
>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>> that "Oh!! now I can run into random OOM errors and the maxrecords 
>> options
>> does not help at all"
>>
>> Excerpt from the above link:
>> Note that all data for a group will be loaded into memory before the
>> function is applied. This can lead to out of memory exceptions, 
>> especially
>> if the group sizes are skewed. The configuration for
>> maxRecordsPerBatch
>> 
>>  is
>> not applied on groups and it is up to the user to ensure that the grouped
>> data will fit into the available memory.
>>
>> Let me know about your used case in case possible
>>
>>
>> Regards,
>> Gourav
>>
>> On Sun, May 5, 2019 at 3:59 AM 

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
The proof is in the pudding

:)



On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta 
wrote:

> Hi Patrick,
>
> super duper, thanks a ton for sharing the code. Can you please confirm
> that this runs faster than the regular UDF's?
>
> Interestingly I am also running same transformations using another geo
> spatial library in Python, where I am passing two fields and getting back
> an array.
>
>
> Regards,
> Gourav Sengupta
>
> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy 
> wrote:
>
>> Human time is considerably more expensive than computer time, so in that
>> regard, yes :)
>>
>> This took me one minute to write and ran fast enough for my needs. If
>> you're willing to provide a comparable scala implementation I'd be happy to
>> compare them.
>>
>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
>>
>> def generate_mgrs_series(lat_lon_str, level):
>>
>> import mgrs
>>
>> m = mgrs.MGRS()
>>
>> precision_level = 0
>>
>> levelval = level[0]
>>
>> if levelval == 1000:
>>
>>precision_level = 2
>>
>> if levelval == 100:
>>
>>precision_level = 3
>>
>> def convert(ll_str):
>>
>>   lat, lon = ll_str.split('_')
>>
>>   return m.toMGRS(lat, lon,
>>
>>   MGRSPrecision = precision_level)
>>
>> return lat_lon_str.apply(lambda x: convert(x))
>>
>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta 
>> wrote:
>>
>>> And you found the PANDAS UDF more performant ? Can you share your code
>>> and prove it?
>>>
>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy 
>>> wrote:
>>>
 I disagree that it's hype. Perhaps not 1:1 with pure scala
 performance-wise, but for python-based data scientists or others with a lot
 of python expertise it allows one to do things that would otherwise be
 infeasible at scale.

 For instance, I recently had to convert latitude / longitude pairs to
 MGRS strings (
 https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing
 a pandas UDF (and putting the mgrs python package into a conda environment)
 was _significantly_ easier than any alternative I found.

 @Rishi - depending on your network is constructed, some lag could come
 from just uploading the conda environment. If you load it from hdfs with
 --archives does it improve?

 On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <
 gourav.sengu...@gmail.com> wrote:

> hi,
>
> Pandas UDF is a bit of hype. One of their blogs shows the used case of
> adding 1 to a field using Pandas UDF which is pretty much pointless. So 
> you
> go beyond the blog and realise that your actual used case is more than
> adding one :) and the reality hits you
>
> Pandas UDF in certain scenarios is actually slow, try using apply for
> a custom or pandas function. In fact in certain scenarios I have found
> general UDF's work much faster and use much less memory. Therefore test 
> out
> your used case (with at least 30 million records) before trying to use the
> Pandas UDF option.
>
> And when you start using GroupMap then you realise after reading
> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
> that "Oh!! now I can run into random OOM errors and the maxrecords options
> does not help at all"
>
> Excerpt from the above link:
> Note that all data for a group will be loaded into memory before the
> function is applied. This can lead to out of memory exceptions, especially
> if the group sizes are skewed. The configuration for
> maxRecordsPerBatch
> 
>  is
> not applied on groups and it is up to the user to ensure that the grouped
> data will fit into the available memory.
>
> Let me know about your used case in case possible
>
>
> Regards,
> Gourav
>
> On Sun, May 5, 2019 at 3:59 AM Rishi Shah 
> wrote:
>
>> Thanks Patrick! I tried to package it according to this instructions,
>> it got distributed on the cluster however the same spark program that 
>> takes
>> 5 mins without pandas UDF has started to take 25mins...
>>
>> Have you experienced anything like this? Also is Pyarrow 0.12
>> supported with Spark 2.3 (according to documentation, it should be fine)?
>>
>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
>> pmccar...@dstillery.com> wrote:
>>
>>> Hi Rishi,
>>>
>>> I've had success using the approach outlined here:
>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>>
>>> Does this work for you?
>>>
>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <
>>> rishishah.s...@gmail.com> wrote:
>>>
 modified the subject & would like to clarify that I am looking to
 create an 

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
Hi Patrick,

super duper, thanks a ton for sharing the code. Can you please confirm that
this runs faster than the regular UDF's?

Interestingly I am also running same transformations using another geo
spatial library in Python, where I am passing two fields and getting back
an array.


Regards,
Gourav Sengupta

On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy 
wrote:

> Human time is considerably more expensive than computer time, so in that
> regard, yes :)
>
> This took me one minute to write and ran fast enough for my needs. If
> you're willing to provide a comparable scala implementation I'd be happy to
> compare them.
>
> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
>
> def generate_mgrs_series(lat_lon_str, level):
>
> import mgrs
>
> m = mgrs.MGRS()
>
> precision_level = 0
>
> levelval = level[0]
>
> if levelval == 1000:
>
>precision_level = 2
>
> if levelval == 100:
>
>precision_level = 3
>
> def convert(ll_str):
>
>   lat, lon = ll_str.split('_')
>
>   return m.toMGRS(lat, lon,
>
>   MGRSPrecision = precision_level)
>
> return lat_lon_str.apply(lambda x: convert(x))
>
> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta 
> wrote:
>
>> And you found the PANDAS UDF more performant ? Can you share your code
>> and prove it?
>>
>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy 
>> wrote:
>>
>>> I disagree that it's hype. Perhaps not 1:1 with pure scala
>>> performance-wise, but for python-based data scientists or others with a lot
>>> of python expertise it allows one to do things that would otherwise be
>>> infeasible at scale.
>>>
>>> For instance, I recently had to convert latitude / longitude pairs to
>>> MGRS strings (
>>> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing
>>> a pandas UDF (and putting the mgrs python package into a conda environment)
>>> was _significantly_ easier than any alternative I found.
>>>
>>> @Rishi - depending on your network is constructed, some lag could come
>>> from just uploading the conda environment. If you load it from hdfs with
>>> --archives does it improve?
>>>
>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 hi,

 Pandas UDF is a bit of hype. One of their blogs shows the used case of
 adding 1 to a field using Pandas UDF which is pretty much pointless. So you
 go beyond the blog and realise that your actual used case is more than
 adding one :) and the reality hits you

 Pandas UDF in certain scenarios is actually slow, try using apply for a
 custom or pandas function. In fact in certain scenarios I have found
 general UDF's work much faster and use much less memory. Therefore test out
 your used case (with at least 30 million records) before trying to use the
 Pandas UDF option.

 And when you start using GroupMap then you realise after reading
 https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
 that "Oh!! now I can run into random OOM errors and the maxrecords options
 does not help at all"

 Excerpt from the above link:
 Note that all data for a group will be loaded into memory before the
 function is applied. This can lead to out of memory exceptions, especially
 if the group sizes are skewed. The configuration for maxRecordsPerBatch
 
  is
 not applied on groups and it is up to the user to ensure that the grouped
 data will fit into the available memory.

 Let me know about your used case in case possible


 Regards,
 Gourav

 On Sun, May 5, 2019 at 3:59 AM Rishi Shah 
 wrote:

> Thanks Patrick! I tried to package it according to this instructions,
> it got distributed on the cluster however the same spark program that 
> takes
> 5 mins without pandas UDF has started to take 25mins...
>
> Have you experienced anything like this? Also is Pyarrow 0.12
> supported with Spark 2.3 (according to documentation, it should be fine)?
>
> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
> pmccar...@dstillery.com> wrote:
>
>> Hi Rishi,
>>
>> I've had success using the approach outlined here:
>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>
>> Does this work for you?
>>
>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah 
>> wrote:
>>
>>> modified the subject & would like to clarify that I am looking to
>>> create an anaconda parcel with pyarrow and other libraries, so that I 
>>> can
>>> distribute it on the cloudera cluster..
>>>
>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <
>>> rishishah.s...@gmail.com> wrote:
>>>
 Hi All,

 I have been trying to 

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Patrick McCarthy
Human time is considerably more expensive than computer time, so in that
regard, yes :)

This took me one minute to write and ran fast enough for my needs. If
you're willing to provide a comparable scala implementation I'd be happy to
compare them.

@F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)

def generate_mgrs_series(lat_lon_str, level):

import mgrs

m = mgrs.MGRS()

precision_level = 0

levelval = level[0]

if levelval == 1000:

   precision_level = 2

if levelval == 100:

   precision_level = 3

def convert(ll_str):

  lat, lon = ll_str.split('_')

  return m.toMGRS(lat, lon,

  MGRSPrecision = precision_level)

return lat_lon_str.apply(lambda x: convert(x))

On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta 
wrote:

> And you found the PANDAS UDF more performant ? Can you share your code and
> prove it?
>
> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy 
> wrote:
>
>> I disagree that it's hype. Perhaps not 1:1 with pure scala
>> performance-wise, but for python-based data scientists or others with a lot
>> of python expertise it allows one to do things that would otherwise be
>> infeasible at scale.
>>
>> For instance, I recently had to convert latitude / longitude pairs to
>> MGRS strings (
>> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing a
>> pandas UDF (and putting the mgrs python package into a conda environment)
>> was _significantly_ easier than any alternative I found.
>>
>> @Rishi - depending on your network is constructed, some lag could come
>> from just uploading the conda environment. If you load it from hdfs with
>> --archives does it improve?
>>
>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta 
>> wrote:
>>
>>> hi,
>>>
>>> Pandas UDF is a bit of hype. One of their blogs shows the used case of
>>> adding 1 to a field using Pandas UDF which is pretty much pointless. So you
>>> go beyond the blog and realise that your actual used case is more than
>>> adding one :) and the reality hits you
>>>
>>> Pandas UDF in certain scenarios is actually slow, try using apply for a
>>> custom or pandas function. In fact in certain scenarios I have found
>>> general UDF's work much faster and use much less memory. Therefore test out
>>> your used case (with at least 30 million records) before trying to use the
>>> Pandas UDF option.
>>>
>>> And when you start using GroupMap then you realise after reading
>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>>> that "Oh!! now I can run into random OOM errors and the maxrecords options
>>> does not help at all"
>>>
>>> Excerpt from the above link:
>>> Note that all data for a group will be loaded into memory before the
>>> function is applied. This can lead to out of memory exceptions, especially
>>> if the group sizes are skewed. The configuration for maxRecordsPerBatch
>>> 
>>>  is
>>> not applied on groups and it is up to the user to ensure that the grouped
>>> data will fit into the available memory.
>>>
>>> Let me know about your used case in case possible
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah 
>>> wrote:
>>>
 Thanks Patrick! I tried to package it according to this instructions,
 it got distributed on the cluster however the same spark program that takes
 5 mins without pandas UDF has started to take 25mins...

 Have you experienced anything like this? Also is Pyarrow 0.12 supported
 with Spark 2.3 (according to documentation, it should be fine)?

 On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
 pmccar...@dstillery.com> wrote:

> Hi Rishi,
>
> I've had success using the approach outlined here:
> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>
> Does this work for you?
>
> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah 
> wrote:
>
>> modified the subject & would like to clarify that I am looking to
>> create an anaconda parcel with pyarrow and other libraries, so that I can
>> distribute it on the cloudera cluster..
>>
>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah 
>> wrote:
>>
>>> Hi All,
>>>
>>> I have been trying to figure out a way to build anaconda parcel with
>>> pyarrow included for my cloudera managed server for distribution but 
>>> this
>>> doesn't seem to work right. Could someone please help?
>>>
>>> I have tried to install anaconda on one of the management nodes on
>>> cloudera cluster... tarred the directory, but this directory doesn't
>>> include all the packages to form a proper parcel for distribution.
>>>
>>> Any help is much appreciated!
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>>
>> --
>> Regards,

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Gourav Sengupta
And you found the PANDAS UDF more performant ? Can you share your code and
prove it?

On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy 
wrote:

> I disagree that it's hype. Perhaps not 1:1 with pure scala
> performance-wise, but for python-based data scientists or others with a lot
> of python expertise it allows one to do things that would otherwise be
> infeasible at scale.
>
> For instance, I recently had to convert latitude / longitude pairs to MGRS
> strings (https://en.wikipedia.org/wiki/Military_Grid_Reference_System).
> Writing a pandas UDF (and putting the mgrs python package into a conda
> environment) was _significantly_ easier than any alternative I found.
>
> @Rishi - depending on your network is constructed, some lag could come
> from just uploading the conda environment. If you load it from hdfs with
> --archives does it improve?
>
> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta 
> wrote:
>
>> hi,
>>
>> Pandas UDF is a bit of hype. One of their blogs shows the used case of
>> adding 1 to a field using Pandas UDF which is pretty much pointless. So you
>> go beyond the blog and realise that your actual used case is more than
>> adding one :) and the reality hits you
>>
>> Pandas UDF in certain scenarios is actually slow, try using apply for a
>> custom or pandas function. In fact in certain scenarios I have found
>> general UDF's work much faster and use much less memory. Therefore test out
>> your used case (with at least 30 million records) before trying to use the
>> Pandas UDF option.
>>
>> And when you start using GroupMap then you realise after reading
>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>> that "Oh!! now I can run into random OOM errors and the maxrecords options
>> does not help at all"
>>
>> Excerpt from the above link:
>> Note that all data for a group will be loaded into memory before the
>> function is applied. This can lead to out of memory exceptions, especially
>> if the group sizes are skewed. The configuration for maxRecordsPerBatch
>> 
>>  is
>> not applied on groups and it is up to the user to ensure that the grouped
>> data will fit into the available memory.
>>
>> Let me know about your used case in case possible
>>
>>
>> Regards,
>> Gourav
>>
>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah 
>> wrote:
>>
>>> Thanks Patrick! I tried to package it according to this instructions, it
>>> got distributed on the cluster however the same spark program that takes 5
>>> mins without pandas UDF has started to take 25mins...
>>>
>>> Have you experienced anything like this? Also is Pyarrow 0.12 supported
>>> with Spark 2.3 (according to documentation, it should be fine)?
>>>
>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
>>> pmccar...@dstillery.com> wrote:
>>>
 Hi Rishi,

 I've had success using the approach outlined here:
 https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html

 Does this work for you?

 On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah 
 wrote:

> modified the subject & would like to clarify that I am looking to
> create an anaconda parcel with pyarrow and other libraries, so that I can
> distribute it on the cloudera cluster..
>
> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> I have been trying to figure out a way to build anaconda parcel with
>> pyarrow included for my cloudera managed server for distribution but this
>> doesn't seem to work right. Could someone please help?
>>
>> I have tried to install anaconda on one of the management nodes on
>> cloudera cluster... tarred the directory, but this directory doesn't
>> include all the packages to form a proper parcel for distribution.
>>
>> Any help is much appreciated!
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>


 --


 *Patrick McCarthy  *

 Senior Data Scientist, Machine Learning Engineering

 Dstillery

 470 Park Ave South, 17th Floor, NYC 10016

>>>
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-05 Thread Patrick McCarthy
I disagree that it's hype. Perhaps not 1:1 with pure scala
performance-wise, but for python-based data scientists or others with a lot
of python expertise it allows one to do things that would otherwise be
infeasible at scale.

For instance, I recently had to convert latitude / longitude pairs to MGRS
strings (https://en.wikipedia.org/wiki/Military_Grid_Reference_System).
Writing a pandas UDF (and putting the mgrs python package into a conda
environment) was _significantly_ easier than any alternative I found.

@Rishi - depending on your network is constructed, some lag could come from
just uploading the conda environment. If you load it from hdfs with
--archives does it improve?

On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta 
wrote:

> hi,
>
> Pandas UDF is a bit of hype. One of their blogs shows the used case of
> adding 1 to a field using Pandas UDF which is pretty much pointless. So you
> go beyond the blog and realise that your actual used case is more than
> adding one :) and the reality hits you
>
> Pandas UDF in certain scenarios is actually slow, try using apply for a
> custom or pandas function. In fact in certain scenarios I have found
> general UDF's work much faster and use much less memory. Therefore test out
> your used case (with at least 30 million records) before trying to use the
> Pandas UDF option.
>
> And when you start using GroupMap then you realise after reading
> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
> that "Oh!! now I can run into random OOM errors and the maxrecords options
> does not help at all"
>
> Excerpt from the above link:
> Note that all data for a group will be loaded into memory before the
> function is applied. This can lead to out of memory exceptions, especially
> if the group sizes are skewed. The configuration for maxRecordsPerBatch
> 
>  is
> not applied on groups and it is up to the user to ensure that the grouped
> data will fit into the available memory.
>
> Let me know about your used case in case possible
>
>
> Regards,
> Gourav
>
> On Sun, May 5, 2019 at 3:59 AM Rishi Shah 
> wrote:
>
>> Thanks Patrick! I tried to package it according to this instructions, it
>> got distributed on the cluster however the same spark program that takes 5
>> mins without pandas UDF has started to take 25mins...
>>
>> Have you experienced anything like this? Also is Pyarrow 0.12 supported
>> with Spark 2.3 (according to documentation, it should be fine)?
>>
>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy 
>> wrote:
>>
>>> Hi Rishi,
>>>
>>> I've had success using the approach outlined here:
>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>>
>>> Does this work for you?
>>>
>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah 
>>> wrote:
>>>
 modified the subject & would like to clarify that I am looking to
 create an anaconda parcel with pyarrow and other libraries, so that I can
 distribute it on the cloudera cluster..

 On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah 
 wrote:

> Hi All,
>
> I have been trying to figure out a way to build anaconda parcel with
> pyarrow included for my cloudera managed server for distribution but this
> doesn't seem to work right. Could someone please help?
>
> I have tried to install anaconda on one of the management nodes on
> cloudera cluster... tarred the directory, but this directory doesn't
> include all the packages to form a proper parcel for distribution.
>
> Any help is much appreciated!
>
> --
> Regards,
>
> Rishi Shah
>


 --
 Regards,

 Rishi Shah

>>>
>>>
>>> --
>>>
>>>
>>> *Patrick McCarthy  *
>>>
>>> Senior Data Scientist, Machine Learning Engineering
>>>
>>> Dstillery
>>>
>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>
>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-05 Thread Gourav Sengupta
hi,

Pandas UDF is a bit of hype. One of their blogs shows the used case of
adding 1 to a field using Pandas UDF which is pretty much pointless. So you
go beyond the blog and realise that your actual used case is more than
adding one :) and the reality hits you

Pandas UDF in certain scenarios is actually slow, try using apply for a
custom or pandas function. In fact in certain scenarios I have found
general UDF's work much faster and use much less memory. Therefore test out
your used case (with at least 30 million records) before trying to use the
Pandas UDF option.

And when you start using GroupMap then you realise after reading
https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
that "Oh!! now I can run into random OOM errors and the maxrecords options
does not help at all"

Excerpt from the above link:
Note that all data for a group will be loaded into memory before the
function is applied. This can lead to out of memory exceptions, especially
if the group sizes are skewed. The configuration for maxRecordsPerBatch

is
not applied on groups and it is up to the user to ensure that the grouped
data will fit into the available memory.

Let me know about your used case in case possible


Regards,
Gourav

On Sun, May 5, 2019 at 3:59 AM Rishi Shah  wrote:

> Thanks Patrick! I tried to package it according to this instructions, it
> got distributed on the cluster however the same spark program that takes 5
> mins without pandas UDF has started to take 25mins...
>
> Have you experienced anything like this? Also is Pyarrow 0.12 supported
> with Spark 2.3 (according to documentation, it should be fine)?
>
> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy 
> wrote:
>
>> Hi Rishi,
>>
>> I've had success using the approach outlined here:
>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>
>> Does this work for you?
>>
>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah 
>> wrote:
>>
>>> modified the subject & would like to clarify that I am looking to create
>>> an anaconda parcel with pyarrow and other libraries, so that I can
>>> distribute it on the cloudera cluster..
>>>
>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah 
>>> wrote:
>>>
 Hi All,

 I have been trying to figure out a way to build anaconda parcel with
 pyarrow included for my cloudera managed server for distribution but this
 doesn't seem to work right. Could someone please help?

 I have tried to install anaconda on one of the management nodes on
 cloudera cluster... tarred the directory, but this directory doesn't
 include all the packages to form a proper parcel for distribution.

 Any help is much appreciated!

 --
 Regards,

 Rishi Shah

>>>
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>
>
> --
> Regards,
>
> Rishi Shah
>


Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-04 Thread Rishi Shah
Thanks Patrick! I tried to package it according to this instructions, it
got distributed on the cluster however the same spark program that takes 5
mins without pandas UDF has started to take 25mins...

Have you experienced anything like this? Also is Pyarrow 0.12 supported
with Spark 2.3 (according to documentation, it should be fine)?

On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy 
wrote:

> Hi Rishi,
>
> I've had success using the approach outlined here:
> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>
> Does this work for you?
>
> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah 
> wrote:
>
>> modified the subject & would like to clarify that I am looking to create
>> an anaconda parcel with pyarrow and other libraries, so that I can
>> distribute it on the cloudera cluster..
>>
>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah 
>> wrote:
>>
>>> Hi All,
>>>
>>> I have been trying to figure out a way to build anaconda parcel with
>>> pyarrow included for my cloudera managed server for distribution but this
>>> doesn't seem to work right. Could someone please help?
>>>
>>> I have tried to install anaconda on one of the management nodes on
>>> cloudera cluster... tarred the directory, but this directory doesn't
>>> include all the packages to form a proper parcel for distribution.
>>>
>>> Any help is much appreciated!
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


-- 
Regards,

Rishi Shah


Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-04-30 Thread Patrick McCarthy
Hi Rishi,

I've had success using the approach outlined here:
https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html

Does this work for you?

On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah 
wrote:

> modified the subject & would like to clarify that I am looking to create
> an anaconda parcel with pyarrow and other libraries, so that I can
> distribute it on the cloudera cluster..
>
> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah 
> wrote:
>
>> Hi All,
>>
>> I have been trying to figure out a way to build anaconda parcel with
>> pyarrow included for my cloudera managed server for distribution but this
>> doesn't seem to work right. Could someone please help?
>>
>> I have tried to install anaconda on one of the management nodes on
>> cloudera cluster... tarred the directory, but this directory doesn't
>> include all the packages to form a proper parcel for distribution.
>>
>> Any help is much appreciated!
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-04-29 Thread Rishi Shah
modified the subject & would like to clarify that I am looking to create an
anaconda parcel with pyarrow and other libraries, so that I can distribute
it on the cloudera cluster..

On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah 
wrote:

> Hi All,
>
> I have been trying to figure out a way to build anaconda parcel with
> pyarrow included for my cloudera managed server for distribution but this
> doesn't seem to work right. Could someone please help?
>
> I have tried to install anaconda on one of the management nodes on
> cloudera cluster... tarred the directory, but this directory doesn't
> include all the packages to form a proper parcel for distribution.
>
> Any help is much appreciated!
>
> --
> Regards,
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah