[jira] [Commented] (SPARK-28269) ArrowStreamPandasSerializer get stack

Hyukjin Kwon (JIRA) Sun, 07 Jul 2019 17:23:35 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879975#comment-16879975
 ]


Hyukjin Kwon commented on SPARK-28269:
--------------------------------------

Seems like I can't open the image. Would you be able to specify Pandas, 
PyArrow, Python version and provide a full reproducer if possible?
It would be also better to just show error message with stacktrace (not image)

> ArrowStreamPandasSerializer get stack
> -------------------------------------
>
>                 Key: SPARK-28269
>                 URL: https://issues.apache.org/jira/browse/SPARK-28269
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>            Reporter: Modi Tamam
>            Priority: Major
>         Attachments: Untitled.xcf
>
>
> I'm working with Pyspark version 2.4.3.
> I have a big data frame:
>  * ~15M rows
>  * ~130 columns
>  * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
> (pandas_df.toPickle() ) resulted with a file of size 2.5GB.
> I have some code that groups this data frame and applying a Pandas-UDF:
>  
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql import functions as F
> import pyarrow.parquet as pq
> import pyarrow as pa
> non_issued_patch="31.7996378000_35.2114362000"
> issued_patch = "31.7995787833_35.2121463045"
> @pandas_udf("patch_name string", PandasUDFType.GROUPED_MAP)
> def foo(pdf):
>  import pandas as pd
>  ret_val = pd.DataFrame({'patch_name': [pdf['patch_name'].iloc[0]]})
>  return ret_val
> full_df=spark.read.parquet('debug-mega-patch')
> df = full_df.filter(F.col("grouping_column") == issued_patch).cache()
> df.groupBy("grouping_column").apply(foo).repartition(1).write.mode('overwrite').parquet('debug-df/')
>  
> {code}
>  
> The above code gets stacked on the ArrowStreamPandasSerializer: (on the first 
> line when reading batch from the reader)
>  
> {code:java}
> for batch in reader:
>  yield [self.arrow_to_pandas(c) for c in      
> pa.Table.from_batches([batch]).itercolumns()]{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-28269) ArrowStreamPandasSerializer get stack

Reply via email to