Guys,

Reminder: There's also an option to give a hint in the query to ignore a given index (you associate the hint with the predicate), which might be simpler for query re-execution sans indices. (Can someone on the dev list add a pointer to the place where this is documented, albeit poorly in all likelihood? I couldn't find it in a quick search just now, but I know it's hiding there...)

Cheers,
Mike

On 5/30/16 9:25 PM, Pouria Pirzadeh wrote:
Hi Magnus,

Thanks for spending time to gather useful information and sharing them with us. I guess the first potential performance pitfall that you may be falling into is the large selectivity of the filters you are using against indexed fields, as Sattam also pointed out.

1) With respect to the attached plans to the previous email, did you add the plans before removing indices or after that ? Because in the plan for Q2, I can still see that the hour index (which you mentioned you dropped) is being used at the bottom.

2) If it is not too painful for you, can you also try two sets of additional queries (their repose time could be helpful):

  * Running a simple count query (without any filter) on the dataset
    (I just wanna figure out what is the total time for a pure full scan).
  * Running those queries that do not benefit from secondary indices
    (those with similar times across different versions) without *any*
    index. This basically can be done by simply creating a new dataset
    without any secondary index, loading your data into it (you can
    either use bulk load to load from raw data file(s), OR an "insert
    into" query to load from your exiting dataset to the new one) and
    then running those queries against the new dataset. I just wanna
    find out if ignoring all the indices makes a difference or not.


Thanks.
Pouria


On Mon, May 30, 2016 at 6:08 AM, Magnus Kongshem <[email protected] <mailto:[email protected]>> wrote:

    Attached you will find:

    - My DLL
    - My queries and their optimized logical plan

    From what I can interpret, the queries does not use any of the
    combined indexes. And query number two, three and four uses the
    "hour" index which is based on the hourOfDay field. As Sattam
    states, indexes based on this field may not be optimal. I deleted
    that index as well as the "day" index and re-ran the queries. The
    result was that the queries used the "stamp" index instead with
    the same poor execution time.

    I'm having some trouble creating a dataset with a filter. I keep
    getting:

    type mismatch: missing a required closed field uid:uuid 
[IllegalStateException]

    use dataverse bigd;

        create type table2 as closed {
    uid: uuid,
            campus: string,
    building: string,
    floor: string,
            timestamp: int32,
    dayOfWeek: int32,
    hourOfDay: int32,
            latitude: double,
            salt_timestamp: int32,
            longitude: double,
            id: string,
    accuracy: double
        }
    create dataset posdata3(table2)
    primary key uid autogenerated with filter on timestamp;

    load dataset posdata3 using localfs
    (("path"="path/to/file/all.adm"),("format"="adm"));

    BG,
    Magnus

    On Mon, May 30, 2016 at 10:01 AM, Magnus Kongshem
    <[email protected] <mailto:[email protected]>> wrote:

        Thank you for your replies. I will test your thoughts by
        dropping the hourOfDay and dayOfWeek indexes, send you the
        optimized query plans and utilize the AsterixDB filters.

        I'll be back.

        BG,
        Magnus

        On Sun, May 29, 2016 at 8:32 AM, Sattam Alsubaiee
        <[email protected] <mailto:[email protected]>> wrote:

            Creating indexes on fields with high selectivities (such
            as hourOfDay and dayOfWeek) are not encouraged at all.
            Each secondary index lookup will have to probe the primary
            index to fetch other fields in the record. It would be
            much more efficient if you just perform scans as opposed
            of accessing secondary indexes when querying such fields.

            I would recommend that you drop at least the following
            indexes:
            drop index posdata.hour;
            drop index posdata.day;

            Also I would highly recommend that you utilize AsterixDB
            filters, which is very good optimization (could save up to
            99% of query time) when you deal with time-correlated
            fields such as timestamps:
            https://asterixdb.apache.org/docs/0.8.8-incubating/aql/filters.html
            http://dl.acm.org/citation.cfm?id=2786007

            Cheers,
            Sattam

            On Sun, May 29, 2016 at 8:58 AM, Michael Carey
            <[email protected] <mailto:[email protected]>> wrote:

                @Pouria: Please share your findings here when you
                check this out - this is quite strange, since none of
                the other performance results that have been obtained
                on the system have looked anything like this.  (I will
                try to look at this too at some point, but will
                unfortunately be MIA from June 1-15 first.) Weird....


                On 5/26/16 9:20 AM, Pouria Pirzadeh wrote:
                Hi Magnus,

                Thanks for your email and sharing the information.
                If it is Ok with you, Would you please share with us
                the exact DDL (including type definitions, dataset
                and index definitions) and exact AQL queries that you
                ran against AsterixDB ?
                I am just interested in checking the query plans and
                see what ended up being run as jobs.

                Thanks.
                Pouria

                On Thu, May 26, 2016 at 4:59 AM, Magnus Kongshem
                <[email protected]
                <mailto:[email protected]>> wrote:

                    Hi,

                    There has been a lot of questions from me
                    regarding AsterixDB and I thank all of you who
                    have answered me. So it is time for me to
                    contribute with some obeservations. I am writing
                    my master thesis where I test multiple databases
                    on a large data set. I should also mention that I
                    have installed AsterixDB on a single machine.

                    What I have observed is that asterixDB has a
                    "poorer" read performance when I specify indexes
                    on the data set compared to not implementing any
                    indexes. See the attachment for details, its an
                    excerpt of my thesis explaining and describing
                    the queries, the indexes and the test results.
                    Any thoughts on these test results?

                    I also cannot help to notice that the read
                    performance for a query querying a small portion,
                    medium portion and large portion of the data set
                    is very similar. The largest query finds 75
                    million records and the smallest query finds 3.5
                    million records, but almost have the same read
                    performance. How can this be?

                    Perhaps you can use these test results in the
                    future development of asterixDB.

                    I you would like, I can send you my final thesis
                    when it's done.

--
                    Mvh

                    Magnus Alderslyst Kongshem
                    +47 415 65 906 <tel:%2B47%20415%2065%20906>







--
        Mvh

        Magnus Alderslyst Kongshem
        Seniorkomiteen
        Online, linjeforeningen for informatikk
        +47 415 65 906 <tel:%2B47%20415%2065%20906>




--
    Mvh

    Magnus Alderslyst Kongshem
    Seniorkomiteen
    Online, linjeforeningen for informatikk
    +47 415 65 906 <tel:%2B47%20415%2065%20906>



Reply via email to