Hi, gaspare
kylin has an assumption that dimension table is small enough to fit in
memory so that the corresponding directiory should contains only one file.
So as a workaround, you can merge these files into one single file, so
that kylin will be able to read from it
<[email protected]>于2015年7月7日周二 下午6:42写道:
> Hi,
>
> I am trying to create a cube from a star schema created using Hive
> External tables (below an example) stored as TEXT FILE (CSV).
>
> CREATE EXTERNAL TABLE IF NOT EXISTS USERS_TABLE (
> uid INT,
> name STRING
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073' LINES TERMINATED BY '\012'
> STORED AS TEXTFILE
> LOCATION '/data/users';
>
>
> To CSV files are obtained from Spark RDDs, so they are saved as part-xxxx.
> Below the HDFS listing
>
> hdfs dfs -ls /data/users
> Found 12 items
> -rw-r--r-- 3 hdfs hdfs 0 2015-07-07 12:05 /data/users/_SUCCESS
> -rw-r--r-- 3 hdfs hdfs 3699360 2015-07-07 12:05 /data/users/part-00000
> -rw-r--r-- 3 hdfs hdfs 3694740 2015-07-07 12:05 /data/users/part-00001
> -rw-r--r-- 3 hdfs hdfs 3685374 2015-07-07 12:05 /data/users/part-00002
> -rw-r--r-- 3 hdfs hdfs 3719646 2015-07-07 12:05 /data/users/part-00003
> -rw-r--r-- 3 hdfs hdfs 3682476 2015-07-07 12:05 /data/users/part-00004
> -rw-r--r-- 3 hdfs hdfs 3679956 2015-07-07 12:05 /data/users/part-00005
> -rw-r--r-- 3 hdfs hdfs 3700242 2015-07-07 12:05 /data/users/part-00006
> -rw-r--r-- 3 hdfs hdfs 3672186 2015-07-07 12:05 /data/users/part-00007
> -rw-r--r-- 3 hdfs hdfs 3682350 2015-07-07 12:05 /data/users/part-00008
> -rw-r--r-- 3 hdfs hdfs 3680292 2015-07-07 12:05 /data/users/part-00009
> -rw-r--r-- 3 hdfs hdfs 3697722 2015-07-07 12:05 /data/users/part-00010
>
> The CUBE build JOB fails when try to build the Dimension Dictionary with
> the following exception (it seems that the Hive Table data directory MUST
> contain only one file)
>
> java.lang.IllegalStateException: Expect 1 and only 1 non-zero file under
> hdfs://gas.gfmintegration.it:8020/data/cdr/bb/dimensions/users, but find
> 11
> at
> org.apache.kylin.dict.lookup.HiveTable.findOnlyFile(HiveTable.java:123)
> at
> org.apache.kylin.dict.lookup.HiveTable.computeHDFSLocation(HiveTable.java:107)
> at
> org.apache.kylin.dict.lookup.HiveTable.getHDFSLocation(HiveTable.java:83)
> at
> org.apache.kylin.dict.lookup.HiveTable.getFileTable(HiveTable.java:76)
> at
> org.apache.kylin.dict.lookup.HiveTable.getSignature(HiveTable.java:71)
> at
> org.apache.kylin.dict.DictionaryManager.buildDictionary(DictionaryManager.java:164)
> at
> org.apache.kylin.cube.CubeManager.buildDictionary(CubeManager.java:154)
> at
> org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:53)
> at
> org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:42)
> at
> org.apache.kylin.job.hadoop.dict.CreateDictionaryJob.run(CreateDictionaryJob.java:53)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at
> org.apache.kylin.job.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:63)
> at
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
> at
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:50)
> at
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
> at
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:132)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> result code:2
>
>
> Do you have any indications on how to create a proper Hive star schema for
> Kylin?
>
> I would like to use external tables (stored as CSV, parquet files or
> HBase) because I need to process the same data also from Spark.
>
> Thanks in advance.
>
> BR,
>
> -- gas
>
>
>
>