Fwd: Re: Trio: AsterixDB, Spark and Zeppelin.

Michael Carey Thu, 11 Aug 2016 14:46:04 -0700

For those potentially not subscribed to users.....



-------- Forwarded Message --------
Subject:        Re: Trio: AsterixDB, Spark and Zeppelin.
Date:   Thu, 11 Aug 2016 14:42:48 -0700
From:   Mike Carey <[email protected]>
Reply-To:       [email protected]
To:     [email protected]



Amarnath,

1. Interesting problem! That tempts me to suggest that your targetdataset of Tweets should make heavier use of AsterixDB's open typingcapabilities. You could make the schema for it only have the "surething" fields, and let the variable parts be self-describing. Have youexperimented with that?

2. This seems like two issues. First, there's an algorithm that needsto be scale tested. That seems like it could be done on Spark withoutforming an (unnecessary?) immediate dependency on AsterixDB'sconnector. Second, there's the desire/need to feed that algorithm (whenit's working) data from AsterixDB so that social/health data scientistscan explore how it works on different data subsets. That indeed has adependency. How about two steps, first step first for the student?(I'm assuming, possibly wrongly, that neither step has been taken yet.)

3a. Could we get a concise list of those queries and the times andexpectations in a little shared document somewhere (or in a JIRAissue)? As it turns out, AsterixDB's "worst aggregate query" is the AQLequivalent of SELECT COUNT(*) FROM Tweets - because it can only run thatquery by scanning the data. While it does it in parallel, that's stillvery slow compared to what you might want. (The reason is that theTweets are stored in a primary-key'ed BTree in AsterixDB and in order toknow how many there are, you have to inspect the leaves.) This would benecessary anyway if the query was anything other than COUNT(*) without apredicate - but in that (lone) case you end up getting a "worst case"time compared to what you might hope for. There are engineeringsolutions to make unfiltered counting run faster, but it's not obvioushow much time one would want to invest in that. (If you could unleashKevin on that we could discuss the work he'd need to do - that would beone model - but it feels like that's not the "normal use case" query, soI'm not sure that'd be the right investment for anyone.)


3b. Could we see some of the common ML queries here, to prepare accordingly?

Thanks!

Cheers,

Mike

PS - Any chance you could participate in the Friday 10-11 status calls?Or perhaps we should set up a separate weekly 1/2 hour for SDSC statuscalls? I think it would be really good for us to view SDSC as the"premier showcase customer" (and transitively, UCLA) make sure we have abetter (less lossy/noisy) connection in place to make sure it's a bigsuccess, and I think weekly would be the right frequency to operate theconnection at, if that works for you all.



On 8/10/16 5:02 PM, Gupta, Amarnath wrote:

Mike:

The timetable actually comes from Sean and Wei, whom we are trying toserve through our efforts at SDSC. There are three issues we aretrying to handle right now.

1. The 2015 Twitter data set, which UCLA needs access to, is large,and we often discover that the actual schema shifts over time, causinga cascade of failures that Kevin and Ian are sorting out.2. While we originally put Spark integration later in our timetable,Wei's group needs access to it now. I spoke to the student and she hasan algorithm that is waiting to be tested for larger scale data.3. *Some* aggregate queries, curiously take a really long time. Ian isalso aware of this. Since aggregate queries are very common formachine learning explorations, I would feel better if these queriesexecute within a reasonable time.

I appreciate your comment about getting higher visibility of SDSC asan early adopter of AsterixDB. Kevin has been spending a lot of timewith AsterixDB and fielding requirements we all dump on him. I amhoping that we can actually achieve something concrete through thiseffort.


Thanks,

Amarnath
------------------------------------------------------------------------
*From:* Michael Carey [[email protected]]
*Sent:* Wednesday, August 10, 2016 4:41 PM
*To:* [email protected]; [email protected]
*Cc:* Gupta, Amarnath; Sean Young
*Subject:* Re: Trio: AsterixDB, Spark and Zeppelin.

Kevin,

Thanks! That helps a lot. Now what we need to know (possibly aboveyour pay grade :-)) is what the timetable is for UCLA (i) wanting toget the results of your assessment of how well what's there works andmeets their needs and (ii) wanting to put stuff into production (andat what scale). I don't anticipate the review and merging takingforever, but this will be Wail's first AsterixDB code contribution -last I knew he was addressing initial reviewing comments (and I'm notsure if all reviews are done yet) - but I think we next need to askUCLA/Sean/Amarnath for the timetable info.


Cheers,

Mike


On 8/10/16 1:33 PM, Coakley, Kevin wrote:

Mike,

UCLA wanted a way to do use Spark’s Machine Learning packages with data stored 
in AsterixDB. We started looking at the Spark connector as way to access the 
data in AsterixDB directly instead of having to export the data from AsterixDB 
to a file and import the file in Spark. I don’t know how this is fits into 
Amarnath’s projects, I was just following up on a request from UCLA to see what 
would be involved in providing this Spark connector to others.

The current status is: I have the Spark connector working in a test environment 
with the queries provided by Wail. I was planning on loading a small amount of 
data into the test AsterixDB server with the Schema Inferencer code and running 
my own queries, but I have not had time yet. The issue with providing others 
with access to the Spark connector is the version of AsterixDB that we are 
running that contains the Twitter data does not have the Schema Inferencer code 
and therefor will not work with the Spark connector.

I don’t believe SDSC would want to update the AsterixDB servers that contain 
the Twitter data with the Schema Inferencer code until after it has been 
approved by you and merged into the master branch. However, even after the 
Schema Inferencer code has been merged into the master branch, we wouldn’t have 
it ready of people to use right away.

I offered to load a small subset of the data from our main servers into my test 
environment that has a working Spark connector for UCLA to test, but it sounds 
like they misunderstood my offer.

I would be happy to help you test the Schema Inferencer and Spark connector if 
you have specific items that you want me to check, I can also give others that 
you select access to test environment so they can run tests themselves. 
Otherwise, I will respond here if I discover any issues.

My current test environment is Zeppelin with the Spark connector on server A, 
AsterixDB with the Schema Inferencer code on server B and a Spark 1.6.0 cluster 
running on servers C, D and E.

-Kevin



On 8/10/16, 9:36 AM, "Mike Carey"<[email protected]>  wrote:

     Kevin,

Q: Could you chime back in here - please 'cc' the user list - with a

     brief (maybe one paragraph) summary of what you are actually trying to
     do at the moment and what its current status is?  (And your timeframe,
     etc.?)

My impression until yesterday was that you were slowly/leisurely

     exploring the new Spark connector to AsterixDB that Wail worked on -
     essentially as his first "beta" user - and that things were moving at
     the pace you wanted (and were setting).  As an early adopter, I was also
     under the impression that you were using his branch for your
     explorations, while he was addressing code review comments, etc.
     However, when I arrived back home in OC after a trip yesterday, I was
     the recipient of a message (via a back channel) warning me that there
     was a blocking issue at SDSC that UCI wasn't being attentive to, one
     that had AsterixDB on the brink being given up on by the UCLA folks, and
     that we'd better get on it or....  (Meanwhile I had not heard any such
     thing from UCLA directly; I was not aware of any blocking Spark issues
     for SDSC nor of any transitively blocking implications for UCLA, and it
     still doesn't look from what I see below like there was one.)

I think that we need to have SDSC's activities be much more visible here

     - likewise for UCLA's - so that the Apache AsterixDB community has much
     better visibility into the goals, activities, progress, and problems of
     our early adopters.  The community wants users to be successful!  It
     will be much more effective (and healthy and productive) if we all know
     what's going on and it is clear to all how each of those things are going.

Thanks!MikeOn 8/10/16 8:36 AM, Wail Alkowaileet wrote:

     > Hi Kevin,
     >
     > Cool!
     > Please let me know if you need any assistance.
     >
     > On Aug 8, 2016 1:42 PM, "Coakley, Kevin"<[email protected]>  wrote:
     >
     >> Hi Wail,
     >>
     >> I figure out the problem, AsterixDB was configured for 127.0.0.1. The
     >> notebook athttps://github.com/Nullification/asterixdb-spark-
     >> 
connector/blob/master/zeppelin-notebook/asterixdb-spark-example/note.json
     >> ran successfully once I recreated the AsterixDB instance to use the
     >> external IP.
     >>
     >> I have not ran any of my own queries but I did get both of the examples
     >>https://github.com/Nullification/asterixdb-spark-connector  to run
     >> successfully.
     >>
     >> Thank you!
     >>
     >> -Kevin
     >>
     >>
     >>
     >> On 8/3/16, 10:23 AM, "Wail Alkowaileet"<[email protected]>  wrote:
     >>
     >>      One more thing:
     >>      Can you paste your cluster configuration as well?
     >>
     >>      Thanks
     >>
     >>     (ETC ETC ETC deleted)

Fwd: Re: Trio: AsterixDB, Spark and Zeppelin.

Reply via email to