Re: Hyracks Job Requirement Configuration

Michael Carey Mon, 29 Jan 2018 08:23:02 -0800

Rana's work shows a clear user requirement (@Xikui pay attention :-)) --we need two forms of parallelism hint, one that does what we currentlydo - which is widen the parallelism AFTER reading from storage at thefirst opportunity to do so - and another that widens it IMMEDIATELY(somehow :-)). The latter is clearly what Rana would ideally have beenable to make use of, so she wouldn't have to change the data layout toget more parallelism. Food for thought.


On 1/28/18 8:29 PM, Murtadha Hubail wrote:

If reloading the data isn’t too much trouble, the first thing I woulddo is recreate the instance with more partitions (e.g. partition percore or partition per 2 cores) and check the cores utilization. Ifthis is the same dataset as the one in your previous email, youmentioned that it was about 10GB per partition, in that case, youmight want to allocate at least 40GB for the buffer cache and you canreduce storage.memorycomponent.globalbudget to get enough memory toexecute the job (depending on the number of partitions you create).After recreating with higher number of partitions, don’t use “SET`compiler.parallelism` "39"”. It will automatically use the number ofpartitions you create.

Regarding the metrics time, it includes the results printing time, soif you want to see if it has any impact, try adding “limit 1” at theend of your query or change it to select count(*) instead of subject_id.


Cheers,

Murtadha

*From: *Rana Alotaibi <[email protected]>
*Date: *Monday, 29 January 2018 at 6:48 AM
*To: *<[email protected]>
*Cc: *<[email protected]>, <[email protected]>
*Subject: *Re: Hyracks Job Requirement Configuration

*- Do you see all cores being fully utilized during the query execution? *

**I have noticed only 6 cores were utilized

*- How much time does the query take right now and how do you measurethe query execution time? Do you wait for the result to be printedsomewhere (e.g. in the browser)?*

I'm using the HTTP APIs. The response is a JSON object that includesthe query execution time:


   { "status": "success",
        "metrics": {
*"elapsedTime": "434.627299814s",
                "executionTime": "434.626137977s",*
                "resultCount": 4943,
                "resultSize": 132293,-
                "processedObjects": 46875
        }
}

I run the query 10 times and took the average which is ~6mins.

*- You mentioned that you have 4 partitions, how many physical harddrives are they mapped to?*


**One physical hard drive

*- Also, increasing the sort/join memory doesn’t necessarily lead to abetter performance. Have you tried changing these values to somethingsmaller and seeing the effects?*


  Yes, I tried the following numbers:

  1) sort-memory: 32MB, join-memory: 64MB

  2) sort-memory: 64MB, join-memory: 128MB

  3) sort-memory: 128MB, join-memory:  265MB

The execution time remains on average ~6 - 6.5mins. I didn't see anyimprovement. The configurations that I have now:


- compiler.parallelism :39 //Only 6 were utilized

- storage.buffercache.size: 20GB

- storage.buffercache.pagesize: 1MB

Thanks,

Rana

On Sun, Jan 28, 2018 at 6:41 PM, Murtadha Hubail<[email protected]<mailto:[email protected]>> wrote:


    I have few questions if you don’t mind:

    Do you see all cores being fully utilized during the query execution?

    How much time does the query take right now and how do you measure
    the query execution time? Do you wait for the result to be printed
    somewhere (e.g. in the browser)?

    You mentioned that you have 4 partitions, how many physical hard
    drives are they mapped to?

    Also, increasing the sort/join memory doesn’t necessarily lead to
    a better performance. Have you tried changing these values to
    something smaller and seeing the effects?

    Cheers,

    Murtadha

    *From: *Rana Alotaibi
    <[email protected]<mailto:[email protected]>>
    *Date: *Monday, 29 January 2018 at 5:21 AM
    *To: *<[email protected]<mailto:[email protected]>>
    *Cc:
    *<[email protected]<mailto:[email protected]>>,
    <[email protected]<mailto:[email protected]>>
    *Subject: *Re: Hyracks Job Requirement Configuration

    Thanks Murtadha! The problem solved. However, increasing the
    number of cores didn't help to improve the performance of that query.

    On Sun, Jan 28, 2018 at 5:05 PM, Murtadha Hubail
    <[email protected]<mailto:[email protected]>> wrote:

        Hi Rana,

        The memory used for query processing is automatically
        calculated as follows:
        JVM Max Memory - storage.buffercache.size -
        storage.memorycomponent.globalbudget

        The documentation defaults for these parameters are outdated.
        The default value for storage.buffercache.size is (JVM Max
        Memory / 4) and it's the same for
        storage.memorycomponent.globalbudget. Since your dataset is
        already loaded, you could reduce the budget of
        storage.memorycomponent.globalbudget. In addition, if I recall
        correctly, your dataset size is way smaller than what's
        allocated for the buffer cache, so you might want to reduce
        the buffer cache budget. That should give you more than enough
        memory to execute on 39 cores.

        Cheers,
        Murtadha


        On 01/29/2018, 3:30 AM, "Mike Carey"
        <[email protected]<mailto:[email protected]>> wrote:

            + dev


            On 1/28/18 3:37 PM, Rana Alotaibi wrote:
            > Hi all,
            >
            > I would like to make AsterixDB utilizes all available
        CPU cores (39)
            > that I have for the following query:
            >
            > USE mimiciii;
            > SET `compiler.parallelism` "39";
            > SET `compiler.sortmemory` "128MB";
            > SET `compiler.joinmemory` "265MB";
            > SELECT P.SUBJECT_ID
            > FROM  LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
            > WHERE E.ITEMID/*+bcast*/=I.ITEMID AND
            >   E.FLAG = 'abnormal' AND
            >   I.FLUID='Blood' AND
            > I.LABEL='Haptoglobin'
            >
            >
            > The total memory size that I have is 125GB(57GB for the
        AsterixDB
            > buffer cache). By running the above query, I got the
        following error:
            >
            > "msg": "HYR0009: Job requirement (memory: 10705403904
        bytes, CPU
            > cores: 39) exceeds capacity (memory:
        3258744832<tel:%28325%29%20874-4832>bytes, CPU cores: 39)"
            >
            > How can I change this capacity default configuration?
        I'm looking into
            > this page :
        https://asterixdb.apache.org/docs/0.9.2/ncservice.html.
            > Could you please point me to the appropriate
        configuration parameter?
            >
            > Thanks
            > -- Rana
            >
            >
            >
            >

Re: Hyracks Job Requirement Configuration

Reply via email to