[ 
https://issues.apache.org/jira/browse/HUDI-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311827#comment-17311827
 ] 

satish commented on HUDI-1690:
------------------------------

Yes, this is sort of a known issue. Is it possible to schedule multiple 
clustering operations (one clustering for every 50 partitions or something like 
that)? We can try using sparkContext.union. But iirc, even that has some 
limitations, so I dont think it'd work for 3000partitions. (at least spark2 it 
didnt scale well.  I can test it again with spark3 ).

> Fix StackOverflowError while running clustering with large number of 
> partitions
> -------------------------------------------------------------------------------
>
>                 Key: HUDI-1690
>                 URL: https://issues.apache.org/jira/browse/HUDI-1690
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>    Affects Versions: 0.9.0
>            Reporter: Rong Ma
>            Priority: Major
>              Labels: sev:high, user-support-issues
>             Fix For: 0.9.0
>
>
> We are testing clustering on a hudi table with about 3000 partitions. The 
> spark driver throws StackOverflowError before all the partitions sorted:
> 21/03/11 19:51:20 ERROR [main] UtilHelpers: Cluster failed
>  java.lang.StackOverflowError
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> org.apache.spark.RangePartitioner.$anonfun$writeObject$1(Partitioner.scala:261)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
>  at org.apache.spark.RangePartitioner.writeObject(Partitioner.scala:254)
>  at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:477)
>  at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> ...
>  
> I see similar issue here:
> [https://stackoverflow.com/questions/30522564/spark-when-union-a-lot-of-rdd-throws-stack-overflow-error]
> Setting the driver's stack size to 100M still has this error. So this is 
> probably because the rdd.union has been called too many times and the result 
> of rdd lineage is too large. I think we should use JavaSparkContext.union 
> instead RDD.union here 
> [https://github.com/apache/hudi/blob/e93c6a569310ce55c5a0fc0655328e7fd32a9da2/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java#L96]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to