Re: My Hadoop Summit Talk: NASA+BigData

Bruce Barkstrom Wed, 20 Mar 2013 07:57:03 -0700

I'll subside after one minor note on the "sky is the archive."

I once had a course from W. W. Morgan, the U. Chicago prof who
developed the atlas of stellar types (A, O, B, etc.).  He had
the spectrum of a "standard type R".  As I recall, two weeks
after he published his atlas with the spectra, the star defining
the type became a variable.


Also, I note that on this very Google Mail page, I can get
a "Free Guide to Big Data", as well as the "IBM Big Data
Free eBook".  I suppose I don't need to go to a conference
to become informed.

Bruce B.

On Wed, Mar 20, 2013 at 10:21 AM, Mattmann, Chris A (388J) <
[email protected]> wrote:

> Hey Bruce,
>
> A couple points:
>
> On 3/20/13 5:46 AM, "Bruce Barkstrom" <[email protected]> wrote:
>
> >That may be a bit better.
> >
> >However, it still isn't clear to me how the physics of the instruments
> >and of the data processing gets into what users understand they
> >can do with the data.
>
> Yeah agreed. At the same time, this is kind of difficult to throw into
> a 45 min with 15 mins "techie talk" that I haven't even prepared yet,
> and even harder to throw in to a 100 word (what you see on the website)
> and 200 word (longer, what I sent you) abstract that they requested.
>
> >
> >As I understand Big Data and analytics, it usually appears to using
> >a lot of statistics to find unexpected correlations in the data, but
> >the techniques aren't looking for causation.  If you're dealing with
> >scientific data, you're usually trying to get to physical causation.
> >That means, I think, that users need to understand how the
> >physics and math constrain what they can do.
>
> ++50 agreed.
>
> >
> >Let me see if I can identify a more concrete example of a
> >concern.  Usually, when we want to deal with physically
> >connected phenomena, we want disparate data to be
> >observing the same chunk of space at the same time.
> >If the Big Data user picks up one piece of data from region
> >X_1 and t_1 and then develops a correlation with observations
> >with data from X_2 and t_2, where X_1 /= X_2 and t_1 /= t_2,
> >it isn't clear why that correlation has anything to do with
> >physical causation.  Of, to put it another way, Big Data
> >may just give more examples of the "cherry picking"
> >climate deniers do when they select data without
> >paying attention to the statistical and physical significance
> >of their "results".
>
> Totally agree. This is the big difference between card
> carrying statisticians a lot of time and *computer science*
> oriented *machine learning* people.
>
> >
> >So, even though the data rates are large by today's
> >standards, I'm not sure that, by itself, is impressive.
>
> Well I have to say it is impressive. Can you show me a disk
> that can today write 700 TB/data per second? Or the filesystem
> drivers and parallel I/O necessary to software them? Imagine in
> astronomy, where they are moving into the time domain, and
> away from the "sky is the archive" "so just reobserve next
> time" mentality, and thus triage, which is super important,
> isn't the main driver and archival is now becoming important,
> and necessary in these eventually 700TB/sec producing systems.
>
> There are all sorts of IO, hardware, computer science, and
> other advances that we don't have that are needed, and that
> these types of examples like the SKA will drive.
>
> OTOH, the sheer infrastructure, domestic and international policy,
> investment, and excitement and sense of nationality that many of
> these new Big Data systems (especially the SKA) are creating in
> their respective countries (e.g., in South Africa), is enough
> to at least suggest to my evidence based mind that there is
> something impressive here.
>
> >Maybe the relevant example would be all those statistics
> >on dams built or tons of steel produced by the Soviet
> >Union.  The hype would be more interesting if it could
> >talk about what new phenomena or understanding
> >these techniques will produce - not just the data rate
> >or the total amount of data being produced.
>
> Agreed, lots of data has been generated for a while. However,
> the volume (total and discrete); velocity, and variety (in
> data types, metadata, etc.) are certainly such that they are
> worthy of current study, at least in the area of data management.
>
> >
> >Maybe it's just a glorified popularity contest; if so,
> >it would seem to be at about the level of interest
> >of the new season of "Dancing with the Stars".
>
> Perhaps, but I know you guys are interested in that show :)
> Who's not?
>
> >I suppose the hype is necessary to generate the
> >funding (which has its uses), but I'm not sure it
> >will do as much as a few million sent to appropriate
> >super PACs to move the politics of climate change
> >along.
>
> Think of this as an IT super PAC for next generation data management
> techniques and systems to deal with data volumes and varieties that
> we don't have hardware or CS tools to manage yet. I'm not talking
> about writing to tape and letting it die the morgue. I'm talking about
> even simple things like making it available after you write it to spinning
> disk.
>
> Cheers,
> Chris
>
> >
> >Bruce B.
> >
> >On Wed, Mar 20, 2013 at 1:16 AM, Mattmann, Chris A (388J) <
> >[email protected]> wrote:
> >
> >> Hey Bruce,
> >>
> >> Hah!
> >>
> >> Unfortunately all you get is the short summary through
> >> the website which does make it scientifically hard to
> >> judge, however, then again this isn't science, it's a
> >> glorified popularity contest.
> >>
> >> I have a little bit more detailed abstract that I wrote up,
> >> pasted below (of course the part that they don't use to solicit votes):
> >>
> >> ---longer abstract
> >> The NASA Jet Propulsion Laboratory, California Institute of
> >> Technology contributes to many Big Data projects for Earth science such
> >>as
> >> the
> >> U.S. National Climate Assessment (NCA) and for astronomy such as next
> >> generation astronomical instruments like the Square Kilometre Array
> >>(SKA)
> >> that
> >> will generate unprecedented volumes of data (700TB/sec!).
> >>
> >> Through these projects, we are addressing four key
> >> challenges critical for the Hadoop community and broader open source Big
> >> Data
> >> community to consider: (1) unobtrusively integrating science algorithms
> >> into
> >> large scale processing systems; (2) selecting and deploying high powered
> >> data
> >> movement technologies for data staging and remote data acquisition;
> >> processing,
> >> and delivery to our customers and users; (3) better leveraging of cloud
> >> computing (storage and processing) technologies in NASA missions; and
> >>(4)
> >> technologies for automatically and rapidly extracting text and metadata
> >> from
> >> the file formats, by some estimates ranging from a few thousand to over
> >> fifty
> >> thousand in total.
> >>
> >> This talk will focus on those Big Data challenges, how NASA
> >> JPL is addressing them both technologically (Hadoop, OODT, Tika, Nutch,
> >> Solr)
> >> and from a community standpoint (Apache, interacting with open source,
> >> etc.).
> >> I¹ll also discuss the future of Big Data at JPL and NASA and how others
> >> can get
> >> Involved.
> >> -----
> >>
> >> You can think of that as the longer version of what I submitted. *grin*
> >>
> >> Cheers,
> >> Chris
> >>
> >>
> >>
> >> On 3/19/13 7:20 PM, "Bruce Barkstrom" <[email protected]> wrote:
> >>
> >> >OK, so you've got a three-word summary of some
> >> >hyperbole with Dumbo, the Flying Elephant.
> >> >How are you going to deal with the real
> >> >scientific constraints on the physics of combining real
> >> >measurement technologies and "mashing stuff together"?
> >> >
> >> >You need to remember that imaging instruments integrate
> >> >radiances with spectral responses and Point Spread Function
> >> >weighted averages over the FOV of whatever the instrument
> >> >was looking at - and that's just the instantaneous (L1 measurement).
> >> >If you do orthorectification, you've got variations in the
> >>uncertainties
> >> >across the image where the parts of the image where you've
> >> >increased the resolving power (by putting interpolated points
> >> >closer together) and have also increased the noise from the
> >> >orthorectification process that acts as a noise multiplier.
> >> >
> >> >Next, you've got stuff like cloud identification (and rejection or
> >> >acceptance) - which depends on spectral response, solar illumination
> >> >(during the day) and temperature and cloud property stuff during
> >> >the night - and finally, you've got temporal interpolation (not just
> >> >creating an average through emission driven by solar illumination
> >> >during the day and IR cooling at night.  Where (the hel)l is
> >> >the physics that deals with this stuff?  If you do get some
> >> >statistical stuff, why should anyone believe it contributes to
> >> >our understanding of climate change?
> >> >
> >> >I won't vote, but you can think of this as my input to your
> >> >scientific conscience.
> >> >
> >> >Bruce B.
> >> >
> >> >On Tue, Mar 19, 2013 at 7:51 PM, Mattmann, Chris A (388J) <
> >> >[email protected]> wrote:
> >> >
> >> >> Hey Guys,
> >> >>
> >> >> I proposed a talk for NASA and Big Data at the Hadoop Summit:
> >> >>
> >> >>
> >> >>
> >>
> >>
> http://hadoopsummit2013.uservoice.com/forums/196822-future-of-apache-hado
> >> >>op
> >> >>
> >>/suggestions/3733470-nasa-science-and-technology-for-big-data-junkies-
> >> >>
> >> >>
> >> >> If you still have votes, and would like to support my talk, I'd
> >> >>certainly
> >> >> appreciate it!
> >> >>
> >> >> Thank you for considering.
> >> >>
> >> >> Cheers,
> >> >> Chris Mattmann
> >> >> Vote Herder
> >> >>
> >> >>
> >>
> >>
>
>

Re: My Hadoop Summit Talk: NASA+BigData

Reply via email to