zhao jintao created KYLIN-3845:
----------------------------------
Summary: Kylin build error If the Kafka data source lacks selected
dimensions or metrics in the kylin stream build.
Key: KYLIN-3845
URL: https://issues.apache.org/jira/browse/KYLIN-3845
Project: Kylin
Issue Type: Bug
Components: Job Engine
Affects Versions: v2.5.2
Environment: Fusion Insight
Reporter: zhao jintao
Fix For: Future
Hi dear team:
I'm developing OLAP Platform based on Kylin2.5.2. During my work, I build a
streaming cube from Kafka source using kafka demo.
In my streaming project, I set country、currency as dimensions and userId as
metrics. But the cube build failed in 3rd step("Extract Fact Table Distinct
Columns"). The exception is java.lang.ArrayIndexOutOfBoundsException.
This is logs:
2019-03-02 14:21:01,492 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do
cleanup, available memory: 1334m
2019-03-02 14:21:01,492 INFO [main] org.apache.kylin.engine.mr.KylinReducer:
Total rows: 127
2019-03-02 14:21:01,492 INFO [main] org.apache.hadoop.mapred.MapTask: Finished
spill 0
2019-03-02 14:21:01,492 INFO [main] org.apache.hadoop.mapred.YarnChild:
Exception running child: java.lang.ArrayIndexOutOfBoundsException:2
2019-03-02 14:21:01,492 INFO [main] org.apache.kylin.engine.mr.KylinReducer: Do
cleanup, available memory: 1334m
at
org.apache.kylin.engine.mr.steps.FactDistinctColumnsMapper.doMap(FactDistinctColumnsMapper.java:177)
at org.apache.kylin.engine.mr.KylinMapper.map(KylinMapper.java:77)
at org.apache.hadoop.mapreduce.Mapper.run(MapperTask.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:187)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java;180)
Then I find that in Kafka datasource, some streaming data lack the userId
column. Most of the streaming data(contry, currency,userId) is
("China","CNY","843c4d");but a small amount of data lack userId, some data is
("China","CNY"). so when run the 3rd step("Extract Fact Table Distinct
Columns"),MR engine will throw exception if the streaming data lack userId.
The I check the source of Kylin, FactDistinctColumnsMapper.java:
public void doMap(KEYIN key, Object record, Context context) throws
IOException, InterruptedException {
Collection<String[]> rowCollection =
flatTableInputFormat.parseMapperInput(record);
for (String[] row : rowCollection) {
context.getCounter(RawDataCounter.BYTES).increment(countSizeInBytes(row));
for (int i = 0; i < allCols.size(); i++) {
String fieldValue = row[columnIndex[i]];
if (fieldValue == null)
continue;
final DataType type = allCols.get(i).getType();
...
I find that columnIndex[i] is equal with the size of row if the streaming data
lack one column. So the row[columnIndex[i]] will throw the
ArrayIndexOutOfBoundsException. So I change this code, check the columnIndex[i]
and the size of row. If columnIndex[i] is equal with or larger than the size of
row, I set fieldValue empty value. And After I change my code, the 3rd
step("Extract Fact Table Distinct Columns") will run success.
Those are what I found, which will cause problem for developers.
How do you think?
Best regard
jintao
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)