在添加count_distinct measure生成global 
dictionary的时候,每个字典文件的大小是否固定,这一块能不能修改生成的文件大小,我看了生成的文件好像每个文件大小都在8M左右。我们现在有一个job 
数据量比较大千亿级别的数据,这样在生成字典的时候写的文件数量非常的多导致一直报错出现Premature EOF

2023-12-18T20:05:43,304 INFO  [logger-thread-0] scheduler.DAGScheduler : 
ResultStage 24 (foreachPartition at DFDictionaryBuilder.scala:94) failed in 
36.866 s due to Job aborted due to stage failure: Task 1560 in stage 24.0 
failed 4 times, most recent failure: Lost task 1560.3 in stage 24.0 (TID 1928) 
(hdc42-mcc10-01-0510-3303-067-tess0097.stratus.rno.ebay.com executor 25): 
java.io.IOException: Premature EOF from inputStream
       at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:204)
       at 
org.apache.spark.dict.NGlobalDictHDFSStore.getBucketDict(NGlobalDictHDFSStore.java:177)
       at 
org.apache.spark.dict.NGlobalDictHDFSStore.getBucketDict(NGlobalDictHDFSStore.java:162)
       at 
org.apache.spark.dict.NBucketDictionary.<init>(NBucketDictionary.java:50)
       at 
org.apache.spark.dict.NGlobalDictionaryV2.loadBucketDictionary(NGlobalDictionaryV2.java:78)
       at 
org.apache.kylin.engine.spark.builder.DFDictionaryBuilder.$anonfun$build$2(DFDictionaryBuilder.scala:98)
       at 
org.apache.kylin.engine.spark.builder.DFDictionaryBuilder.$anonfun$build$2$adapted(DFDictionaryBuilder.scala:94)
       at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
       at 
org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
       at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2257)
       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
       at org.apache.spark.scheduler.Task.run(Task.scala:131)
       at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1469)
       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)

从hdfs上看每个文件在8M左右
[图片包含 淋浴, 绿色, 窗户, 大  描述已自动生成]

这个job数据量大概在2千亿行级别,同样的job千万级别的不会出现这个问题,但是数据量大的情况下一直出现这个Premature 
EOF错误,我在google后给的一种解释如下:

Premature EOF can occur due to multiple reasons, one of which is spawning of 
huge number of threads to write to disk on one reducer node using 
FileOutputCommitter. MultipleOutputs class allows you to write to files with 
custom names and to accomplish that, it spawns one thread per file and binds a 
port to it to write to the disk. Now this puts a limitation on the number of 
files that could be written to at one reducer node. I encountered this error 
when the number of files crossed 12000 roughly on one reducer node, as the 
threads got killed and the _temporary folder got deleted leading to plethora of 
these exception messages. My guess is - this is not a memory overshoot issue, 
nor it could be solved by allowing hadoop engine to spawn more threads. 
Reducing the number of files being written at one time at one node solved my 
problem - either by reducing the actual number of files being written, or by 
increasing reducer nodes.

Reply via email to