Need to increase the value of the following parameters prevents the dictionary bucket from becoming larger kylin.dictionary.globalV2-threshold-bucket-size=500000 kylin.dictionary.globalV2-init-load-factor=0.5 kylin.dictionary.globalV2-bucket-overhead-factor=1.5
It is also recommended to synchronize the code and use the global dictionary V3 version. You will find that the performance will be greatly improved. On Wed, Dec 20, 2023 at 10:47 AM Li, Can <c...@ebay.com.invalid> wrote: > 在添加count_distinct measure生成global dictionary > 的时候,每个字典文件的大小是否固定,这一块能不能修改生成的文件大小,我看了生成的文件好像每个文件大小都在8M左右。我们现在有一个job > 数据量比较大千亿级别的数据,这样在生成字典的时候写的文件数量非常的多导致一直报错出现Premature EOF > > > > 2023-12-18T20:05:43,304 INFO [logger-thread-0] scheduler.DAGScheduler : > ResultStage 24 (foreachPartition at DFDictionaryBuilder.scala:94) failed in > 36.866 s due to Job aborted due to stage failure: Task 1560 in stage 24.0 > failed 4 times, most recent failure: Lost task 1560.3 in stage 24.0 (TID > 1928) (hdc42-mcc10-01-0510-3303-067-tess0097.stratus.rno.ebay.com > executor 25): java.io.IOException: Premature EOF from inputStream > > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:204) > > at > org.apache.spark.dict.NGlobalDictHDFSStore.getBucketDict(NGlobalDictHDFSStore.java:177) > > at > org.apache.spark.dict.NGlobalDictHDFSStore.getBucketDict(NGlobalDictHDFSStore.java:162) > > at > org.apache.spark.dict.NBucketDictionary.<init>(NBucketDictionary.java:50) > > at > org.apache.spark.dict.NGlobalDictionaryV2.loadBucketDictionary(NGlobalDictionaryV2.java:78) > > at > org.apache.kylin.engine.spark.builder.DFDictionaryBuilder.$anonfun$build$2(DFDictionaryBuilder.scala:98) > > at > org.apache.kylin.engine.spark.builder.DFDictionaryBuilder.$anonfun$build$2$adapted(DFDictionaryBuilder.scala:94) > > at > org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020) > > at > org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020) > > at > org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2257) > > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > > at org.apache.spark.scheduler.Task.run(Task.scala:131) > > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1469) > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > > > > 从hdfs上看每个文件在8M左右 > > [image: 图片包含 淋浴, 绿色, 窗户, 大 描述已自动生成] > > > > 这个job数据量大概在2千亿行级别,同样的job千万级别的不会出现这个问题,但是数据量大的情况下一直出现这个Premature EOF错误,我在 > google后给的一种解释如下: > > > > Premature EOF can occur due to multiple reasons, one of which is spawning > of huge number of threads to write to disk on one reducer node using > FileOutputCommitter. MultipleOutputs class allows you to write to files > with custom names and to accomplish that, it spawns one thread per file and > binds a port to it to write to the disk. Now this puts a limitation on the > number of files that could be written to at one reducer node. I encountered > this error when the number of files crossed 12000 roughly on one reducer > node, as the threads got killed and the _temporary folder got deleted > leading to plethora of these exception messages. My guess is - this is not > a memory overshoot issue, nor it could be solved by allowing hadoop engine > to spawn more threads. Reducing the number of files being written at one > time at one node solved my problem - either by reducing the actual number > of files being written, or by increasing reducer nodes. >