You can refer to the following articles -->
https://cwiki.apache.org/confluence/display/KYLIN/Global+Dictionary+on+Spark


On Wed, Dec 20, 2023 at 1:51 PM Li, Can <c...@ebay.com.invalid> wrote:

> Thanks. I have a question about that why does kylin5 uses the ‘PREV_’ and
> ‘CURR_’ prefix to name global dictionary files. When does service write and
> read from the ‘PREV_’ suffix files?
>
> 发件人: JunQing Cai <caicai2...@gmail.com>
> 日期: 星期三, 2023年12月20日 11:28
> 收件人: dev@kylin.apache.org <dev@kylin.apache.org>
> 主题: Re: Premature EOF Build global dictionary
> External Email
>
> The configurations mentioned are all at the model-level. You can only
> modify the configuration of a certain model and do not need to restart
> kylin.
>
> On Wed, Dec 20, 2023 at 11:19 AM MINGMING GE <7mmi...@gmail.com> wrote:
>
> > Modifying these parameters can only reduce the number of files, but it
> > cannot avoid the current situation of a huge number of files. You may
> need
> > to delete the dictionary and modify the parameters before rebuilding.
> >
> > On Wed, Dec 20, 2023 at 11:16 AM MINGMING GE <7mmi...@gmail.com> wrote:
> >
> > > Need to increase the value of the following parameters prevents the
> > > dictionary bucket from becoming larger
> > > kylin.dictionary.globalV2-threshold-bucket-size=500000
> > > kylin.dictionary.globalV2-init-load-factor=0.5
> > > kylin.dictionary.globalV2-bucket-overhead-factor=1.5
> > >
> > > It is also recommended to synchronize the code and use the global
> > > dictionary V3 version. You will find that the performance will be
> greatly
> > > improved.
> > >
> > > On Wed, Dec 20, 2023 at 10:47 AM Li, Can <c...@ebay.com.invalid>
> wrote:
> > >
> > >> 在添加count_distinct measure生成global dictionary
> > >> 的时候,每个字典文件的大小是否固定,这一块能不能修改生成的文件大小,我看了生成的文件好像每个文件大小都在8M左右。我们现在有一个job
> > >> 数据量比较大千亿级别的数据,这样在生成字典的时候写的文件数量非常的多导致一直报错出现Premature EOF
> > >>
> > >>
> > >>
> > >> 2023-12-18T20:05:43,304 INFO  [logger-thread-0]
> scheduler.DAGScheduler :
> > >> ResultStage 24 (foreachPartition at DFDictionaryBuilder.scala:94)
> > failed in
> > >> 36.866 s due to Job aborted due to stage failure: Task 1560 in stage
> > 24.0
> > >> failed 4 times, most recent failure: Lost task 1560.3 in stage 24.0
> (TID
> > >> 1928) (hdc42-mcc10-01-0510-3303-067-tess0097.stratus.rno.ebay.com
> > >> executor 25): java.io.IOException: Premature EOF from inputStream
> > >>
> > >>        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:204)
> > >>
> > >>        at
> > >>
> >
> org.apache.spark.dict.NGlobalDictHDFSStore.getBucketDict(NGlobalDictHDFSStore.java:177)
> > >>
> > >>        at
> > >>
> >
> org.apache.spark.dict.NGlobalDictHDFSStore.getBucketDict(NGlobalDictHDFSStore.java:162)
> > >>
> > >>        at
> > >>
> > org.apache.spark.dict.NBucketDictionary.<init>(NBucketDictionary.java:50)
> > >>
> > >>        at
> > >>
> >
> org.apache.spark.dict.NGlobalDictionaryV2.loadBucketDictionary(NGlobalDictionaryV2.java:78)
> > >>
> > >>        at
> > >>
> >
> org.apache.kylin.engine.spark.builder.DFDictionaryBuilder.$anonfun$build$2(DFDictionaryBuilder.scala:98)
> > >>
> > >>        at
> > >>
> >
> org.apache.kylin.engine.spark.builder.DFDictionaryBuilder.$anonfun$build$2$adapted(DFDictionaryBuilder.scala:94)
> > >>
> > >>        at
> > >> org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
> > >>
> > >>        at
> > >>
> >
> org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
> > >>
> > >>        at
> > >>
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2257)
> > >>
> > >>        at
> > >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > >>
> > >>        at org.apache.spark.scheduler.Task.run(Task.scala:131)
> > >>
> > >>        at
> > >>
> >
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
> > >>
> > >>        at
> > >> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1469)
> > >>
> > >>        at
> > >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
> > >>
> > >>        at
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >>
> > >>        at
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >>
> > >>        at java.lang.Thread.run(Thread.java:748)
> > >>
> > >>
> > >>
> > >> 从hdfs上看每个文件在8M左右
> > >>
> > >> [image: 图片包含 淋浴, 绿色, 窗户, 大 描述已自动生成]
> > >>
> > >>
> > >>
> > >> 这个job数据量大概在2千亿行级别,同样的job千万级别的不会出现这个问题,但是数据量大的情况下一直出现这个Premature
> EOF错误,我在
> > >> google后给的一种解释如下:
> > >>
> > >>
> > >>
> > >> Premature EOF can occur due to multiple reasons, one of which is
> > spawning
> > >> of huge number of threads to write to disk on one reducer node using
> > >> FileOutputCommitter. MultipleOutputs class allows you to write to
> files
> > >> with custom names and to accomplish that, it spawns one thread per
> file
> > and
> > >> binds a port to it to write to the disk. Now this puts a limitation on
> > the
> > >> number of files that could be written to at one reducer node. I
> > encountered
> > >> this error when the number of files crossed 12000 roughly on one
> reducer
> > >> node, as the threads got killed and the _temporary folder got deleted
> > >> leading to plethora of these exception messages. My guess is - this is
> > not
> > >> a memory overshoot issue, nor it could be solved by allowing hadoop
> > engine
> > >> to spawn more threads. Reducing the number of files being written at
> one
> > >> time at one node solved my problem - either by reducing the actual
> > number
> > >> of files being written, or by increasing reducer nodes.
> > >>
> > >
> >
>

Reply via email to