Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-26 Thread JiaTao Tao
You are welcome, ShaoFeng! Storage and query engine are inseparable and
should design together for fully gaining each other's abilities. And I'm
very excited about the new coming columnar storage and query engine!


-- 


Regards!

Aron Tao


ShaoFeng Shi  于2018年10月26日周五 下午10:28写道:

> Exactly; Thank you jiatao for the comments!
>
> JiaTao Tao  于2018年10月25日周四 下午6:12写道:
>
> > As far as I'm concerned, using Parquet as Kylin's storage format is
> pretty
> > appropriate. From the aspect of integrating Spark, Spark made a lot of
> > optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading
> and
> > lazy dict decoding, etc.
> >
> >
> > And here are my thoughts about integrating Spark and our query engine. As
> > Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this
> > as a small table and we can read this cuboid as a DataFrame directly,
> which
> > can be directly queried by Spark, a bit like this:
> >
> >
> ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx").
> > (We need to implement some Kylin's advanced aggregations, as for some
> > Kylin's basic aggregations like sum/min/max, we can use Spark's directly)
> >
> >
> >
> > *Compare to our old query engine, the advantages are as follows:*
> >
> >
> >
> > 1. It is distributed! Our old query engine will get all data into a query
> > node and then calculate, it's a single point of failure and often leads
> OOM
> > when in a huge amount of data.
> >
> >
> >
> > 2. It is simple and easy to debug(every step is very clear and
> > transparent), you can collect data after every single phase,
> > e.g.(filter/aggregation/projection, etc.), so you can easily check out
> > which operation/phase went wrong. Our old query engine uses Calcite for
> > post-calculation, it's difficult when pinpointing problems, especially
> when
> > relating to code generation, and you cannot insert your own logic during
> > computation.
> >
> >
> >
> > 3. We can fully enjoy all efforts that Spark made for optimizing
> > performance, e.g. Catalyst/Tungsten, etc.
> >
> >
> >
> > 4. It is easy for unit tests, you can test every step separately, which
> > could reduce the testing granularity of Kylin's query engine.
> >
> >
> >
> > 5. Thanks to Spark's DataSource API, we can change Parquet to other data
> > formats easily.
> >
> >
> >
> > 6. A lot of upstream tools for Spark like many machine learning tools can
> > directly be integrated with us.
> >
> >
> >
> > ==
> >
> >
> ==
> >
> >  Hi Kylin developers.
> >
> >
> >
> > HBase has been Kylin’s storage engine since the first day; Kylin on
> > HBase
> >
> > has been verified as a success which can support low latency & high
> >
> > concurrency queries on a very large data scale. Thanks to HBase, most
> > Kylin
> >
> > users can get on average less than 1-second query response.
> >
> >
> >
> > But we also see some limitations when putting Cubes into HBase; I
> > shared
> >
> > some of them in the HBaseConf Asia 2018[1] this August. The typical
> >
> > limitations include:
> >
> >
> >
> >- Rowkey is the primary index, no secondary index so far;
> >
> >
> >
> > Filtering by row key’s prefix and suffix can get very different
> > performance
> >
> > result. So the user needs to do a good design about the row key;
> > otherwise,
> >
> > the query would be slow. This is difficult sometimes because the user
> > might
> >
> > not predict the filtering patterns ahead of cube design.
> >
> >
> >
> >- HBase is a key-value instead of a columnar storage
> >
> >
> >
> > Kylin combines multiple measures (columns) into fewer column families
> > for
> >
> > smaller data size (row key size is remarkable). This causes HBase
> often
> >
> > needing to read more data than requested.
> >
> >
> >
> >- HBase couldn't run on YARN
> >
> >
> >
> > This makes the deployment and auto-scaling a little complicated,
> > especially
> >
> > in the cloud.
> >
> >
> >
> > In one word, HBase is complicated to be Kylin’s storage. The
> > maintenance,
> >
> > debugging is also hard for normal developers. Now we’re planning to
> > seek a
> >
> > simple, light-weighted, read-only storage engine for Kylin. The new
> >
> > solution should have the following characteristics:
> >
> >
> >
> >- Columnar layout with compression for efficient I/O;
> >
> >- Index by each column for quick filtering and seeking;
> >
> >- MapReduce / Spark API for parallel processing;
> >
> >- HDFS compliant for scalability and availability;
> >
> >- Mature, stable and extensible;
> >
> >
> >
> > With the plugin architecture[2] introduced in Kylin 1.5, adding
> > multiple
> >
> > storages to Kylin is possible. Some companies like Kyligence Inc and
> >
> > Meituan.com, have 

Re: Apache Kylin

2018-10-26 Thread ShaoFeng Shi
Hi Goran,

I do know some users are using Kylin to replace Cognos Cubes, getting big
improvements in terms of performance, capability, flexibility, and others.
Usually, that is a total solution with a set of tools and professional
services from vendors. If you're interested, I can find someone to connect
with you.

For the second questions, what do you mean by "deploy Kylin cube to end
user"?  Export Cube to the desktop for offline analysis? The cube is in
Hadoop/HBase. Kylin exposes a query interface (JDBC, ODBC, Rest API) to the
customer, so they can connect from their analytics tools remotely. Hope
this can help.

Goran Čekol  于2018年10月26日周五 下午9:00写道:

> Hello,
>
> At the moment I'm working in Business Intelligence (BI) department and we
> are working with IBM Cognos products (reporting, cubes (MOLAP, ROLAP,
> dynamic cubes). We also have a big data department that is based on
> Cloudera.
> I am interested in Apache Kylin product and I'd like to know if there is a
> posibility to integrate Apachy Kylin with IBM Cognos.
> Can you tell me what type of support are you offering to open source
> product if any?
> Also, is there a way to deploy Kylin cube to end customer so that they can
> use drag and drop technique?
>
> Thank you in advance!
>
> Best regards,
> Goran Cekol
>
>
>
>
> - Pravne napomene -
> Ova elektronička poruka i njeni prilozi mogu sadržavati povlaštene
> informacije i/ili povjerljive informacije. Molimo Vas da poruku ne čitate
> ako niste njen naznačeni primatelj. Ako ste ovu poruku primili greškom,
> molimo Vas da o tome obavijestite pošiljatelja i da izvornu poruku i njene
> privitke uništite bez čitanja ili bilo kakvog pohranjivanja. Svaka
> neovlaštena upotreba, distribucija, reprodukcija ili priopćavanje ove
> poruke zabranjena je. PBZ d.d. ne preuzima odgovornost za sadržaj ove
> poruke, odnosno za posljedice radnji koje bi proizašle iz proslijeđenih
> informacija, a niti stajališta izražena u ovoj poruci ne odražavaju nužno
> službena stajališta PBZ d.d.. S obzirom na nepostojanje potpune sigurnosti
> e-mail komunikacije, PBZ d.d. ne preuzima odgovornost za eventualnu štetu
> nastalu uslijed zaraženosti e-mail poruke virusom ili drugim štetnim
> programom, neovlaštene interferencije, pogrešne ili zakašnjele dostave
> poruke uslijed tehničkih problema. PBZ d.d. zadržava pravo nadziranja i
> pohranjivanja e-mail poruka koje se šalju iz PBZ d.d. ili u nju pristižu.
>
> - Disclaimer -
> This e-mail message and its attachments may contain privileged and/or
> confidential information. Please do not read the message if You are not its
> designated recipient. If You have received this message by mistake, please
> inform its sender and destroy the original message and its attachments
> without reading or storing of any kind. Any unauthorized use, distribution,
> reproduction or publication of this message is forbidden. PBZ d.d. is
> neither responsible for the contents of this message, nor for the
> consequences arising from actions based on the forwarded information, nor
> do opinions contained within this message necessarily reflect the official
> opinions of   PBZ d.d.. Considering the lack of complete security of e-mail
> communication, PBZ d.d. is not responsible for the potential damage created
> due to infection of an e-mail message with a virus or other malicious
> program, unauthorized interference, erroneous or delayed delivery of the
> message due to technical problems. PBZ d.d. reserves the right to supervise
> and store both incoming and outgoing  e-mail messages.
>
>
>
>

-- 
Best regards,

Shaofeng Shi 史少锋


Re: does Apache Kylin need a Apache Derby or Mysql for run the sample cube

2018-10-26 Thread BOT
Hi ebrahim zare:
Derby or MySQL is needed but they are not for Kylin directly. Actually they are 
used to store the metadata of Hadoop??s components(For example, Hive). For 
Kylin, both metadata and cube data are stored in HBase. So you should configure 
HBase and HDFS correctly.


Besides, you should use apache-kylin-2.5.0-bin-hadoop3.tar.gz or 
apache-kylin-2.5.0-bin-cdh60.tar.gz for Hadoop 3, and HBase version is required 
2.0.  
I recommend you to use HDP 3.0 or CDH 6.0 instead of community Hadoop 3 because 
they are easy to set up a Hadoop 3 cluster with less environmental problems.


Best Regards


Lijun Cao
-- Original --
From: ebrahim zare 
Date: Fri,Oct 26,2018 9:00 PM
To: dev 
Subject: Re: does Apache Kylin need a Apache Derby or Mysql for run the sample 
cube



I installed Java and Hadoop and Hbase and Hive and Spark and Kylin.
hadoop-3.0.3

hbase-1.2.6

apache-hive-2.3.3-bin

spark-2.2.2-bin-without-hadoop

apache-kylin-2.3.1-bin

I will be grateful if someone in Help me with Kyle's installation and
configuration them.

Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-26 Thread ShaoFeng Shi
Exactly; Thank you jiatao for the comments!

JiaTao Tao  于2018年10月25日周四 下午6:12写道:

> As far as I'm concerned, using Parquet as Kylin's storage format is pretty
> appropriate. From the aspect of integrating Spark, Spark made a lot of
> optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading and
> lazy dict decoding, etc.
>
>
> And here are my thoughts about integrating Spark and our query engine. As
> Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this
> as a small table and we can read this cuboid as a DataFrame directly, which
> can be directly queried by Spark, a bit like this:
>
> ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx").
> (We need to implement some Kylin's advanced aggregations, as for some
> Kylin's basic aggregations like sum/min/max, we can use Spark's directly)
>
>
>
> *Compare to our old query engine, the advantages are as follows:*
>
>
>
> 1. It is distributed! Our old query engine will get all data into a query
> node and then calculate, it's a single point of failure and often leads OOM
> when in a huge amount of data.
>
>
>
> 2. It is simple and easy to debug(every step is very clear and
> transparent), you can collect data after every single phase,
> e.g.(filter/aggregation/projection, etc.), so you can easily check out
> which operation/phase went wrong. Our old query engine uses Calcite for
> post-calculation, it's difficult when pinpointing problems, especially when
> relating to code generation, and you cannot insert your own logic during
> computation.
>
>
>
> 3. We can fully enjoy all efforts that Spark made for optimizing
> performance, e.g. Catalyst/Tungsten, etc.
>
>
>
> 4. It is easy for unit tests, you can test every step separately, which
> could reduce the testing granularity of Kylin's query engine.
>
>
>
> 5. Thanks to Spark's DataSource API, we can change Parquet to other data
> formats easily.
>
>
>
> 6. A lot of upstream tools for Spark like many machine learning tools can
> directly be integrated with us.
>
>
>
> ==
>
> ==
>
>  Hi Kylin developers.
>
>
>
> HBase has been Kylin’s storage engine since the first day; Kylin on
> HBase
>
> has been verified as a success which can support low latency & high
>
> concurrency queries on a very large data scale. Thanks to HBase, most
> Kylin
>
> users can get on average less than 1-second query response.
>
>
>
> But we also see some limitations when putting Cubes into HBase; I
> shared
>
> some of them in the HBaseConf Asia 2018[1] this August. The typical
>
> limitations include:
>
>
>
>- Rowkey is the primary index, no secondary index so far;
>
>
>
> Filtering by row key’s prefix and suffix can get very different
> performance
>
> result. So the user needs to do a good design about the row key;
> otherwise,
>
> the query would be slow. This is difficult sometimes because the user
> might
>
> not predict the filtering patterns ahead of cube design.
>
>
>
>- HBase is a key-value instead of a columnar storage
>
>
>
> Kylin combines multiple measures (columns) into fewer column families
> for
>
> smaller data size (row key size is remarkable). This causes HBase often
>
> needing to read more data than requested.
>
>
>
>- HBase couldn't run on YARN
>
>
>
> This makes the deployment and auto-scaling a little complicated,
> especially
>
> in the cloud.
>
>
>
> In one word, HBase is complicated to be Kylin’s storage. The
> maintenance,
>
> debugging is also hard for normal developers. Now we’re planning to
> seek a
>
> simple, light-weighted, read-only storage engine for Kylin. The new
>
> solution should have the following characteristics:
>
>
>
>- Columnar layout with compression for efficient I/O;
>
>- Index by each column for quick filtering and seeking;
>
>- MapReduce / Spark API for parallel processing;
>
>- HDFS compliant for scalability and availability;
>
>- Mature, stable and extensible;
>
>
>
> With the plugin architecture[2] introduced in Kylin 1.5, adding
> multiple
>
> storages to Kylin is possible. Some companies like Kyligence Inc and
>
> Meituan.com, have developed their customized storage engine for Kylin
> in
>
> their product or platform. In their experience, columnar storage is a
> good
>
> supplement for the HBase engine. Kaisen Kang from Meituan.com has
> shared
>
> their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
>
> Beijing.
>
>
>
> We plan to do a PoC with Apache Parquet + Apache Spark in the next
> phase.
>
> Parquet is a standard columnar file format and has been widely
> supported by
>
> many projects like Hive, Impala, Drill, etc. Parquet is adding the page
>
> level column index to support 

Re: does Apache Kylin need a Apache Derby or Mysql for run the sample cube

2018-10-26 Thread ShaoFeng Shi
No, derby/mysql is not needed; By default, Kylin uses HBase for metadata
persistency; You only need to make sure Hive/HBase/Hadoop working fine.

I see you're using Hadoop 3; please note that, only from 2.5, Kylin
provides the binary package for Hadoop 3. Besides,  we only tested it on
HDP 3 and CDH 6, whose HBase is 2.0 version. Please try to align your
Hadoop component versions with these releases.

ebrahim zare  于2018年10月26日周五 下午9:00写道:

> I installed Java and Hadoop and Hbase and Hive and Spark and Kylin.
> hadoop-3.0.3
>
> hbase-1.2.6
>
> apache-hive-2.3.3-bin
>
> spark-2.2.2-bin-without-hadoop
>
> apache-kylin-2.3.1-bin
>
> I will be grateful if someone in Help me with Kyle's installation and
> configuration them.
>


-- 
Best regards,

Shaofeng Shi 史少锋


Apache Kylin

2018-10-26 Thread Goran Čekol
Hello,

At the moment I'm working in Business Intelligence (BI) department and we are 
working with IBM Cognos products (reporting, cubes (MOLAP, ROLAP, dynamic 
cubes). We also have a big data department that is based on Cloudera.
I am interested in Apache Kylin product and I'd like to know if there is a 
posibility to integrate Apachy Kylin with IBM Cognos.
Can you tell me what type of support are you offering to open source product if 
any?
Also, is there a way to deploy Kylin cube to end customer so that they can use 
drag and drop technique?

Thank you in advance!

Best regards,
Goran Cekol




- Pravne napomene -
Ova elektronička poruka i njeni prilozi mogu sadržavati povlaštene informacije 
i/ili povjerljive informacije. Molimo Vas da poruku ne čitate ako niste njen 
naznačeni primatelj. Ako ste ovu poruku primili greškom, molimo Vas da o tome 
obavijestite pošiljatelja i da izvornu poruku i njene privitke uništite bez 
čitanja ili bilo kakvog pohranjivanja. Svaka neovlaštena upotreba, 
distribucija, reprodukcija ili priopćavanje ove poruke zabranjena je. PBZ d.d. 
ne preuzima odgovornost za sadržaj ove poruke, odnosno za posljedice radnji 
koje bi proizašle iz proslijeđenih informacija, a niti stajališta izražena u 
ovoj poruci ne odražavaju nužno službena stajališta PBZ d.d.. S obzirom na 
nepostojanje potpune sigurnosti e-mail komunikacije, PBZ d.d. ne preuzima 
odgovornost za eventualnu štetu nastalu uslijed zaraženosti e-mail poruke 
virusom ili drugim štetnim programom, neovlaštene interferencije, pogrešne ili 
zakašnjele dostave poruke uslijed tehničkih problema. PBZ d.d. zadržava pravo 
nadziranja i pohranjivanja e-mail poruka koje se šalju iz PBZ d.d. ili u nju 
pristižu.

- Disclaimer -
This e-mail message and its attachments may contain privileged and/or 
confidential information. Please do not read the message if You are not its 
designated recipient. If You have received this message by mistake, please 
inform its sender and destroy the original message and its attachments without 
reading or storing of any kind. Any unauthorized use, distribution, 
reproduction or publication of this message is forbidden. PBZ d.d. is neither 
responsible for the contents of this message, nor for the consequences arising 
from actions based on the forwarded information, nor do opinions contained 
within this message necessarily reflect the official opinions of   PBZ d.d.. 
Considering the lack of complete security of e-mail communication, PBZ d.d. is 
not responsible for the potential damage created due to infection of an e-mail 
message with a virus or other malicious program, unauthorized interference, 
erroneous or delayed delivery of the message due to technical problems. PBZ 
d.d. reserves the right to supervise and store both incoming and outgoing  
e-mail messages.





does Apache Kylin need a Apache Derby or Mysql for run the sample cube

2018-10-26 Thread ebrahim zare
I installed Java and Hadoop and Hbase and Hive and Spark and Kylin.
hadoop-3.0.3

hbase-1.2.6

apache-hive-2.3.3-bin

spark-2.2.2-bin-without-hadoop

apache-kylin-2.3.1-bin

I will be grateful if someone in Help me with Kyle's installation and
configuration them.


Re: How to assign namespace of hbase in kylin

2018-10-26 Thread Lijun Cao
Hi Scott Fan:

You can find the namespace configuration in $KYLIN_HOME/conf/kylin.properties.

Just as the picture shown below:


Best Regards

Lijun Cao

> 在 2018年10月26日,16:36,Scott Fan  写道:
> 
> Hi,
> 
> How can I assign the namespace of hbase.  I don’t want to use the default 
> namespace of hbase.
> 
> Thanks



How to assign namespace of hbase in kylin

2018-10-26 Thread Scott Fan
Hi,

How can I assign the namespace of hbase.  I don’t want to use the default 
namespace of hbase.

Thanks

[jira] [Created] (KYLIN-3650) support for hive table partitioned on separate columns year,month and day

2018-10-26 Thread dipesh (JIRA)
dipesh created KYLIN-3650:
-

 Summary: support for hive table partitioned on separate columns 
year,month and day
 Key: KYLIN-3650
 URL: https://issues.apache.org/jira/browse/KYLIN-3650
 Project: Kylin
  Issue Type: Improvement
  Components: Metadata
Reporter: dipesh


partition support on hive table having different partition columns year , month 
and day



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KYLIN-3649) segment region count and size are not correct when using mysql as Kylin metadata storage

2018-10-26 Thread Lingang Deng (JIRA)
Lingang Deng created KYLIN-3649:
---

 Summary: segment region count and size are not correct when using 
mysql  as Kylin metadata storage
 Key: KYLIN-3649
 URL: https://issues.apache.org/jira/browse/KYLIN-3649
 Project: Kylin
  Issue Type: Bug
  Components: Metadata
Affects Versions: v2.5.0
Reporter: Lingang Deng


As titles, segment region count and size are not correct.
{code:java}
if ("hbase".equals(getConfig().getMetadataUrl().getScheme())) {
try {
logger.debug("Loading HTable info " + cubeName + ", " + tableName);

// use reflection to isolate NoClassDef errors when HBase is not 
available
hr = (HBaseResponse) 
Class.forName("org.apache.kylin.rest.service.HBaseInfoUtil")//
.getMethod("getHBaseInfo", new Class[] { String.class, 
KylinConfig.class })//
.invoke(null, tableName, this.getConfig());
} catch (Throwable e) {
throw new IOException(e);
}
}
{code}
 Judgement is not valid when using mysql  as Kylin metadata storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Unable to connect to Kylin Web UI

2018-10-26 Thread jenkinsliu
Thanks,
I solved it.
We must choose the the reserved port in virtual box.eg. 6

--
Sent from: http://apache-kylin.74782.x6.nabble.com/