在添加count_distinct measure生成global dictionary的时候,每个字典文件的大小是否固定,这一块能不能修改生成的文件大小,我看了生成的文件好像每个文件大小都在8M左右。我们现在有一个job 数据量比较大千亿级别的数据,这样在生成字典的时候写的文件数量非常的多导致一直报错出现Premature EOF
2023-12-18T20:05:43,304 INFO [logger-thread-0] scheduler.DAGScheduler : ResultStage 24 (foreachPartition at DFDictionaryBuilder.scala:94) failed in 36.866 s due to Job aborted due to stage failure: Task 1560 in stage 24.0 failed 4 times, most recent failure: Lost task 1560.3 in stage 24.0 (TID 1928) (hdc42-mcc10-01-0510-3303-067-tess0097.stratus.rno.ebay.com executor 25): java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:204) at org.apache.spark.dict.NGlobalDictHDFSStore.getBucketDict(NGlobalDictHDFSStore.java:177) at org.apache.spark.dict.NGlobalDictHDFSStore.getBucketDict(NGlobalDictHDFSStore.java:162) at org.apache.spark.dict.NBucketDictionary.<init>(NBucketDictionary.java:50) at org.apache.spark.dict.NGlobalDictionaryV2.loadBucketDictionary(NGlobalDictionaryV2.java:78) at org.apache.kylin.engine.spark.builder.DFDictionaryBuilder.$anonfun$build$2(DFDictionaryBuilder.scala:98) at org.apache.kylin.engine.spark.builder.DFDictionaryBuilder.$anonfun$build$2$adapted(DFDictionaryBuilder.scala:94) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2257) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1469) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 从hdfs上看每个文件在8M左右 [图片包含 淋浴, 绿色, 窗户, 大 描述已自动生成] 这个job数据量大概在2千亿行级别,同样的job千万级别的不会出现这个问题,但是数据量大的情况下一直出现这个Premature EOF错误,我在google后给的一种解释如下: Premature EOF can occur due to multiple reasons, one of which is spawning of huge number of threads to write to disk on one reducer node using FileOutputCommitter. MultipleOutputs class allows you to write to files with custom names and to accomplish that, it spawns one thread per file and binds a port to it to write to the disk. Now this puts a limitation on the number of files that could be written to at one reducer node. I encountered this error when the number of files crossed 12000 roughly on one reducer node, as the threads got killed and the _temporary folder got deleted leading to plethora of these exception messages. My guess is - this is not a memory overshoot issue, nor it could be solved by allowing hadoop engine to spawn more threads. Reducing the number of files being written at one time at one node solved my problem - either by reducing the actual number of files being written, or by increasing reducer nodes.