Hi Jiang, I'm afraid that our investigation is moving on for the moment, so I don't have the resources to set that up. If we end up coming back around, I will try that, however.
Thanks, sam On Fri, Feb 27, 2015 at 7:31 PM, 蒋旭 <[email protected]> wrote: > Hi Sam, > > Could you try the pre-join on your test data set firstly? You can verify > whether kylin can meet your requirements on the test dat set or not. > > If the pre-join solution works, we can add "pre-join" option into cube > definition and automate it into cube build engine. Then you can change the > dimension data easily that won't impacting the cube building. > > Thanks > Jiang Xu > > ------------------ 原始邮件 ------------------ > *发件人:* Samuel Bock <[email protected]> > *发送时间:* 2015年02月28日 02:26 > *收件人:* 蒋旭 <[email protected]> > *抄送:* dev <[email protected]> > *主题:* Re: OutOfMemoryError on step #3 of Cube build > > While that might be possible when putting together a test dataset, the > actual system will need to retain the ability to change dimension data > easily. A prejoined table would make that significantly harder (among other > things). > > thanks, > Sam > > > On Wed, Feb 25, 2015 at 4:38 PM, 蒋旭 <[email protected]> wrote: > > > As a workaround, could you prejoin the big dimension table with fact > table > > in hive? Then, you can run Kylin on the prejoined table. > > > > We will do the optimization on the big dimension table later. > > > > Thanks > > Jiang Xu > > > > ------------------ 原始邮件 ------------------ > > *发件人:* Samuel Bock <[email protected]> > > *发送时间:* 2015年02月26日 03:28 > > *收件人:* dev <[email protected]> > > *主题:* Re: OutOfMemoryError on step #3 of Cube build > > > > Thank you for the follow up, > > > > Our dimension table is 25 million rows for our test data set, and would > be > > far larger in production. Given that, it sounds like our data doesn't fit > > the Kylin use case. I appreciate the assistance in tracking down the > source > > of this issue, > > > > cheers, > > sam > > > > On Tue, Feb 24, 2015 at 7:28 PM, Shi, Shaofeng <[email protected]> wrote: > > > > > Hi Samuel, > > > > > > Kylin only supports the star schema: only 1 fact table join with > multiple > > > lookup tables. The lookup table need be small so that Kylin can read > them > > > into memory for join and cube build. Also as you found, Kylin will take > > > snapshot on the lookup tables and persist them in Hbase; That should be > > > the problem. In your case, how many rows there in the KEYWORDS table? > > > > > > On 2/21/15, 2:12 AM, "Samuel Bock" <[email protected]> wrote: > > > > > > >Thank you for you response, > > > > > > > >I went into the code, and I'm fairly confident that I've isolated the > > > >problem. The OutOfMemoryError is part of the dimension dictionary > step, > > > >but > > > >is not actually related to the dictionary itself (since, as you > > mentioned, > > > >that is skipped when dictionary=false). The problem arises from the > > second > > > >half of that step in which it builds the dimension table snapshot. > > Looking > > > >at the code, the process of building the snapshot table loads in the > > > >entire > > > >table into memory as strings (SnapshotTable.takeSnapshot), then > > serializes > > > >that to an in memory ByteArrayOutputStream > (ResourceStore.putResource), > > > >then finally creates a copy of the internal byte array from the stream > > in > > > >order to store it in HBase > (HBaseResourceStore.checkAndPutResourceImpl). > > > >That means that there needs to be space for three in-memory copies of > > the > > > >full dimension table. Given that even our test subset dimension table > is > > > >25 > > > >million rows, 14 columns, that becomes problematic. From > > experimentation, > > > >it breaks even with 95 gig heap. > > > > > > > >For completeness, the log leading up to the crash (minus the pointless > > zk > > > >messages) is: > > > > - Start to execute command: > > > > -cubename foo -segmentname FULL_BUILD -input > > > > > > >/tmp/kylin-7d2b7588-17c0-4d80-9962-14ca63929186/foo/fact_distinct_columns > > > >[QuartzScheduler_Worker-1]:[2015-02-19 > > > > > > >22:59:01,284][INFO][com.kylinolap.cube.cli.DictionaryGeneratorCLI.processS > > > >egment(DictionaryGeneratorCLI.java:57)] > > > >- Building snapshot of KEYWORDS > > > >[QuartzScheduler_Worker-2]:[2015-02-19 > > > > > > >22:59:53,241][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > > > >r.java:60)] > > > >- 0 pending jobs > > > >[QuartzScheduler_Worker-3]:[2015-02-19 > > > > > > >23:00:53,252][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > > > >r.java:60)] > > > >- 0 pending jobs > > > >[QuartzScheduler_Worker-1]:[2015-02-19 > > > > > > >23:01:01,278][INFO][com.kylinolap.dict.lookup.FileTableReader.autoDetectDe > > > >lim(FileTableReader.java:156)] > > > >- Auto detect delim to be ' ', split line to 14 columns -- > > > >1020_18768_4_127200_4647593_group_341686994 group 19510703 0 18768 > 1020 > > > >341686994 4647593 371981 4 127200 CONTENT 2015-01-21 22:16:36.227246 > > > >[http-bio-7070-exec-8]:[2015-02-19 > > > > > > >23:02:07,980][DEBUG][com.kylinolap.rest.service.AdminService.getConfigAsSt > > > >ring(AdminService.java:91)] > > > >- Get Kylin Runtime Config > > > >[QuartzScheduler_Worker-4]:[2015-02-19 > > > > > > >23:02:53,934][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > > > >r.java:60)] > > > >- 0 pending jobs > > > >[QuartzScheduler_Worker-1]:[2015-02-19 > > > > > > >23:03:10,216][DEBUG][com.kylinolap.common.persistence.ResourceStore.putRes > > > >ource(ResourceStore.java:166)] > > > >- Saving resource > > > > > > >/table_snapshot/part-00000.csv/f87954d5-fdfa-4903-9f82-771d85df6367.snapsh > > > >ot > > > >(Store kylin_metadata_qa@hbase) > > > >[QuartzScheduler_Worker-6]:[2015-02-19 > > > > > > >23:04:53,230][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > > > >r.java:60)] > > > >- 0 pending jobs > > > >java.lang.OutOfMemoryError: Requested array size exceeds VM limit > > > >Dumping heap to java_pid3705.hprof ... > > > > > > > > > > > >The cube JSON is: > > > > > > > >{ > > > > "uuid": "ba6105ca-a18d-4839-bed0-c89b86817110", > > > > "name": "foo", > > > > "description": "", > > > > "dimensions": [ > > > > { > > > > "id": 1, > > > > "name": "KEYWORDS_DERIVED", > > > > "join": { > > > > "type": "left", > > > > "primary_key": [ > > > > "DIM_ID" > > > > ], > > > > "foreign_key": [ > > > > "KEYWORD_DIM_ID" > > > > ] > > > > }, > > > > "hierarchy": null, > > > > "table": "KEYWORDS", > > > > "column": "{FK}", > > > > "datatype": null, > > > > "derived": [ > > > > "PUBLISHER_GROUP_ID", > > > > "PUBLISHER_CAMPAIGN_ID", > > > > "PUBLISHER_ID" > > > > ] > > > > } > > > > ], > > > > "measures": [ > > > > { > > > > "id": 1, > > > > "name": "_COUNT_", > > > > "function": { > > > > "expression": "COUNT", > > > > "parameter": { > > > > "type": "constant", > > > > "value": "1" > > > > }, > > > > "returntype": "bigint" > > > > }, > > > > "dependent_measure_ref": null > > > > }, > > > > { > > > > "id": 2, > > > > "name": "CONVERSIONS", > > > > "function": { > > > > "expression": "SUM", > > > > "parameter": { > > > > "type": "column", > > > > "value": "CONVERSIONS" > > > > }, > > > > "returntype": "bigint" > > > > }, > > > > "dependent_measure_ref": null > > > > } > > > > ], > > > > "rowkey": { > > > > "rowkey_columns": [ > > > > { > > > > "column": "KEYWORD_DIM_ID", > > > > "length": 0, > > > > "dictionary": "false", > > > > "mandatory": false > > > > } > > > > ], > > > > "aggregation_groups": [ > > > > [ > > > > "KEYWORD_DIM_ID" > > > > ] > > > > ] > > > > }, > > > > "signature": "T+aYTH/KlCwwmVAGRQR3hQ==", > > > > "capacity": "LARGE", > > > > "last_modified": 1424367558297, > > > > "fact_table": "FACTS", > > > > "null_string": null, > > > > "filter_condition": "KEYWORDS.PUBLISHER_GROUP_ID=386784931", > > > > "cube_partition_desc": { > > > > "partition_date_column": null, > > > > "partition_date_start": 0, > > > > "cube_partition_type": "APPEND" > > > > }, > > > > "hbase_mapping": { > > > > "column_family": [ > > > > { > > > > "name": "F1", > > > > "columns": [ > > > > { > > > > "qualifier": "M", > > > > "measure_refs": [ > > > > "_COUNT_", > > > > "CONVERSIONS" > > > > ] > > > > } > > > > ] > > > > } > > > > ] > > > > }, > > > > "notify_list": [ > > > > "sam" > > > > ] > > > >} > > > > > > > > > > > >Cheers, > > > >sam > > > > > > > >On Thu, Feb 19, 2015 at 9:49 PM, 周千昊 <[email protected]> wrote: > > > > > > > >> Also since you set the dictionary to false, there should not be any > > > >>memory > > > >> consuming while building dictionary. > > > >> So can you also give us the json description of the cube?(in the > cube > > > >>tab, > > > >> click the corresponding cube, click the json button) > > > >> > > > >> > > > >> On Fri Feb 20 2015 at 1:39:15 PM 周千昊 <[email protected]> wrote: > > > >> > > > >> > Hi, Samuel > > > >> > Can you give us some detail log, so we can dig into the root > > > >>cause > > > >> > > > > >> > On Fri Feb 20 2015 at 2:44:32 AM Samuel Bock < > > [email protected] > > > > > > > >> > wrote: > > > >> > > > > >> >> Hello all, > > > >> >> > > > >> >> We are in the process of evaluating Kylin for use as an OLAP > > engine. > > > >>To > > > >> >> that end, we are trying to get a minimum viable setup with a > > > >> >> representative > > > >> >> sample of our data in order to gather performance metrics. We > have > > > >>kylin > > > >> >> running against a 10 node cluster, the provided cubes build > > > >>successfully > > > >> >> and the system seems functional. Attempting to build a simple > cube > > > >> against > > > >> >> our data results in an OutOfMemoryError in the kylin server > process > > > >>(so > > > >> >> far > > > >> >> we have given it up to a 46 gig heap). I was wondering if you > could > > > >>give > > > >> >> me > > > >> >> some guidance as to likely causes, any configurations I'm likely > to > > > >>have > > > >> >> missed before I start diving into the source. I have changed the > > > >> >> "dictionary" setting to false, as recommended for > high-cardinality > > > >> >> dimensions, but have not changed configuration significantly > apart > > > >>from > > > >> >> that. > > > >> >> > > > >> >> For reference, the sizes of the hive tables we're building the > > cubes > > > >> from > > > >> >> dimension table: 25,399,061 rows > > > >> >> fact table: 270,940,921 rows > > > >> >> > > > >> >> (And as a note, there are no pertinent log messages except to > > > >>indicate > > > >> >> that > > > >> >> it is in the Build Dimension Dictionary step) > > > >> >> > > > >> >> Thank you, > > > >> >> sam bock > > > >> >> > > > >> > > > > >> > > > > > > > > > > > >
